Notes for Monday, March 20
Our goal is to interactively optimize a linear regression that predicts the
2015 PISA math results
based on a set of four country-level statistics:
per capita GDP from IMF,
income inequality from the World Bank,
the Human Development Index from the UN,
and Education spending as a percentage of GDP from the World Bank.
I collected tab-delimited data files for each of these statistics by copying and pasting HTML tables from Wikipedia into text files.
Use the sliders to set regression parameters. The best I've been able to get is around 84k squared error.
We encountered several problems while integrating these data sources. These are all typical of real data science curation issues. Here's a summary:
- Identifiers: the country name field has a leading space in many cases. Calling .trim() on strings is a good idea.
- Identifiers: The GDP file listed South Korea as Korea, South. We decided to fix this by editing the data file. Macedonia also appeared in different forms.
- Missing values: Kosovo and Montenegro were missing from some files. Macau and Taiwan were also incomplete. We decided to drop these.
- Number formats: The GDP file included commas for human readability. We used a regular expression to remove them.
- Scale: per capita GDP has large variance and is on a completely different scale from all other variables. We substituted the log of this value.
- Data provenance: Do we really trust these variables?
- Politics: How accurate are these tests? Are test-takers a representative sample?
- Sample size: outliers tend to be small islands (Singapore, DR). If we picked comparable sized regions from larger countries (e.g. Massachusetts), would we get more variability?