Notes for Wednesday, May 2 (to prepare you for May the Fourth)
- Many visual presentations include models -- lines or clusterings or some other visual element that is derived from data and summarizes data but is not itself given in the data file. Lines can be very powerful. They draw your eye to trends that might not be visible easily from a noisy scatter plot.
- But how do we know that these are meaningful? Often we hear about "p values" or "t tests", but in my experience even students who have studied these concepts have difficulty describing what they mean or why they are important. In this example we'll see that the p-value for a linear regression slope gives almost exactly the same result as a simple computational procedure.
- A simple x/y linear model argues that knowing the x value for an observation should tell you something about the y value for that observation. If that's true, we can estimate the strength of that relationship by calculating the slope of the linear regression between x and y from all the x,y pairs in the data set. The problem is, if it's not true, we can still calculate a slope, and with finite, noisy data, it's very unlikely that the slope will be exactly zero.
- What we can do is a "permutation test" that simulates what a similar dataset would look like if there were no relationship between x and y. We keep all the same x's and the same y's, but shuffle the pairs of values. Doing this repeatedly and recording the linear regression line we get each time gives us a visual representation of what slopes we might expect for a data set with the same x and y values but no relationship.
- If the red permuted linear model lines cover up the original blue model line, then the original model falls with the range that we would expect to see by random chance, given the variability of the dataset. The result may still be interesting, and even statistically significant, but it is less convincing. If the original model sits far outside the range of the replicated lines, the result is statistically significant.
- This brings us back to p-values. The function at the bottom of the page calculates the p-value for a given dataset, in a way that would take several weeks of grad-level math statistics to explain. But as you add more and more permutations of the data, the proportion of slopes that are steeper (up OR down) than the original model's slope should converge to something very close to the p-value.
- Try different numbers of samples, change the true slope, and add more or less noise. How does the p-value and the appearance of the permutation tests change?