INFO 3300 - Data-driven Web Applications

Look at the job growth numbers. Before 2011, we see the economic apocalypse of the Great Recession. Half a million more people per month were losing jobs than gaining them. But after 2011, things have been extremely stable. There's no obvious pattern of ups and downs from month to month.
Sir Ronald Fisher defined a test for pattern early last century. If you apply a function to a dataset — this could be anything, but in this case the slope of a 2D linear model — you get a result. The test is to see whether that value is unlikely to be due to random chance. We simulate by actually imposing randomness.
How can we estimate the strength of a correlation? If there are two variables, x and y, then the strongest (positive) correlation would be if the lowest x value is paired with the lowest y value, and so on until the highest x value is paired with the highest y value. There's a 1 in N! chance that that will happen by random chance if there's no real connection between variables. That would be strong evidence, but we usually don't get such a clear indication.
The trick of the permutation test is to sample a number of random datasets with the same x values and the same y values, but in different combinations. Do we see values that look like our actual value, or are they very different? In this experiment we'll shuffle the data many times, and each time record what the
When do we decide that there's a pattern? Look at the range of linear regression lines we're getting from the fake, shuffled data. Does our real value look plausible within that range? If it doesn't, we have good evidence that we have an association that's unlikely to be just bad luck in our data collection. If it does, we shouldn't necessarily throw out our result, but we shouldn't bet the company on it.

Notes for Friday, April 14