Run the scripts in the /scripts
directory, in numerical order. That, plus the data files in this repository from Kaggle (you can clone the repo, download them via the browser, or get them from Kaggle) will get you up and running with predictions. Below are expanations of the different scripts and the reasoning behind them.
There's a lot of released data on Kaggle that you could use. I kept my parameters simple. I included:
- Ken Pomeroy's team rankings - a well-regarded rating system of team strength
- Homefield advantage - teams do better at home, and while tournament games are all technically held on neutral ground, most regular season games (which make up the training set) have a host and a visitor
- Preseason ranking - some article I read (NY Times?) convinced me that independent of team ranking at the time of the tournament (Pomeroy), a team's preseason ranking has some predictive value of tournament success. Also, at least in 2016 this data wasn't supplied by Kaggle nor was it in a flat file on the web. I hoped this unique data source would differentiate my model's prediction. Luck is a huge factor in this tournament, so some differentiation would set me up to get lucky.
- Unfortunately the preseason rankings have not made a big difference in my predictions.
There is a leakage issue with the Pomeroy ratings, in that the historical rankings are end-of-season ratings, and incorporate the results of tournament games that I'm using them to predict. I don't think this is a big deal and I do not address it.
Scripts 01 and 02
I pseudo-scraped Ken Pomeroy's men's basketball ratings data, going back to 2002. I mirrored his data in a Google Sheets document, then used the googlesheets
package to extract and tidy it.
See my short googlesheets accessing script, which grabs the pre-season rankings for each year and then combines them into a single tidy table, and extracts the Pomeroy ratings.
I used the rvest
package to scrape the preseason rankings from the College Poll Archive. Here are the pre-season rankings for 2018, conveniently in a scrape-able HTML table - so let's scrape them for each year.
Script 03
There are lots of data sources to unite and transform. This script takes the various inputs (Kaggle's data on 150k+ past game outcomes, the scraped data, a names crosswalk) and transforms it into tidy data for modeling.
Script 04
I know a lot about tidying dirty data. I don't know much about machine learning. Which means this is the part where I learn the most :) In part 4, the script trains various models on the input data and tests the results. There we select a model with which we'll make our predictions.
You may wish to tweak this part, and I'd particularly welcome feedback and ideas (and pull requests?) here.
It's worth noting that Michael Lopez and Gregory Matthews won this competition in 2014 with a logistic regression model (I think) using Ken Pomeroy ratings data. Theirs was probably a bit better than this, but it's directionally-close.
Script 05
Pretty simple: once you have your model trained, you'll need to download the Kaggle form of games to predict. In Part 1, it's tourney games from 2014-2017; in the real contest, Part 2, it's every possible game this year. This script takes the Kaggle form and makes predictions in a ready-to-submit format. (You'll need to have run parts of script 03_tidy_raw_data.R to make functions and a data.frame accessible).