Using Luigi by Spotify to manage dependencies between batch jobs

Posted on 18th Jan, 2015

When building piplines that play with data, for instance to update your daily forecast or transform your normalized tables into something a cube likes, you will find yourself trying to keep multiple bath jobs in sync. The Luigi framework by Spotify helps with dependencies between jobs, moving data and gives you a nice web-interface to see the status of your system.

Unfortunately the documentation ot he framework is not very clear how to fully install it. So let's start with that. Luigi can be installed with pip, or building it from source (you'll need git, gcc and the python header files, also it seems python 2.7 is the way to go):

sudo apt-get install git gcc python-dev
git clone https://github.com/spotify/luigi.git
cd luigi
python setup.py install

This is not all. To get the central scheduler running, which you need for full functionality, like the web interface. To run tasks without the central scheduler use the option --local-scheduler at the command line when calling the task. To install the libraries the central scheduler needs:

sudo pip intall docutils python-deamon

To get the central scheduler running the Luigi documentation recommends to run the following (remove the line break):

PYTHONPATH=. python bin/luigid --background --pidfile <PATH_TO_PIDFILE> 
--logdir <PATH_TO_LOGDIR> --state-path <PATH_TO_STATEFILE>

To just test the central scheduler you can also just run the command luigid. Open your web browser, and browse to localhost:8082, if you installed Luigi on your own computer, or replace localhost with the server address. You should see the a screen similar to the following screenshot:

An empty task list

Now for the fun part: let's test two dependent tasks! I have an example python script here, and an example dataset here. Change the example_dir in the script to match the location where you put the files and run with python luigi_example.py SummarizeTask. If all went fine you should have two new files in your directory: output.txt - a copy of testdata.csv, and output2.txt - containing the average value of the first column of the dataset, a count of rows, and the directory the files are in.

In the web interface dependency graph looks like this when the tasks are running:

Running and pending task

And like this when the tasks have finished:

All tasks finished

I plan to play more with this framework, so expect more complex examples soon.