Notes for Monday, March 20

Our goal is to interactively optimize a linear regression that predicts the 2015 PISA math results based on a set of four country-level statistics: per capita GDP from IMF, income inequality from the World Bank, the Human Development Index from the UN, and Education spending as a percentage of GDP from the World Bank. I collected tab-delimited data files for each of these statistics by copying and pasting HTML tables from Wikipedia into text files.

Use the sliders to set regression parameters. The best I've been able to get is around 84k squared error.

We encountered several problems while integrating these data sources. These are all typical of real data science curation issues. Here's a summary: