Spend time in Kaggle competitions more efficently

Posted on 11th Dec, 2015

Question for those who, like me, enjoy participating in Kaggle competitions: Where do you spend most of your time?

For me a lot of time goes into plumbing code. Code that is required to tie all the interesting bits of logic together. Code that does not add anything to the actual solution, but is required to have a working solution.

The first few competitions I started like this:

  1. Start a python script that reads in the data set from the Kaggle competition.
  2. Write some code to extract interesting features from the data.
  3. Set up cross validation to estimate performance on unseen data.
  4. Train a model with general parameters.
  5. Test performance using cross validation.
  6. Repeat with different models, parameters and features.

In practice this meant that there was a "master script" that had the most current implementation, and a couple of copies to variations. This work flow becomes messy really fast. Imagine a folder with 30 scripts where some have do have the "new" features, some don't. Some are still a work in progress from some time ago, but forgotten.

There must be a better solution for this! A more organized way of working. A different way to tackle the problem. It came together when I started thinking in terms of flexible and fixed.

What is flexible in the steps above? Reading in the raw data, processing it to get an interesting feature vector. What model, with which parameters must be trained. What is fixed? The fact that there are features, that there is a model. The error metric that is used to define performance. Using cross validation to estimate performance on unseen data.

So I wrote a python class that accepts two parameters: the path to the feature and target vectors in Numpy .npz files, and a model that behaves like a sklearn model (it has fit() and predict() functions). The class reads the feature and target vector in, splits it up in folds, trains the model, tests it, and returns the error on the test folds. This saves me a lot of time. Time I can spend on thinking about new features and testing different models.

Sometimes the simple ideas are the ones that help you the most.