Using Luigi by Spotify to manage dependencies between batch jobs
When building piplines that play with data, for instance to update your daily forecast or transform your normalized tables into something a cube likes, you will find yourself trying to keep multiple bath jobs in sync. The Luigi framework by Spotify helps with dependencies between jobs, moving data and gives you a nice web-interface to see the status of your system.
Unfortunately the documentation ot he framework is not very clear how to fully install it. So let's start with that. Luigi can be installed with pip, or building it from source (you'll need git, gcc and the python header files, also it seems python 2.7 is the way to go):
sudo apt-get install git gcc python-dev git clone https://github.com/spotify/luigi.git cd luigi python setup.py install
This is not all. To get the central scheduler running, which you need for full functionality, like the web interface. To run tasks without the central scheduler use the option --local-scheduler at the command line when calling the task. To install the libraries the central scheduler needs:
sudo pip intall docutils python-deamon
To get the central scheduler running the Luigi documentation recommends to run the following (remove the line break):
PYTHONPATH=. python bin/luigid --background --pidfile <PATH_TO_PIDFILE> --logdir <PATH_TO_LOGDIR> --state-path <PATH_TO_STATEFILE>
To just test the central scheduler you can also just run the command luigid. Open your web browser, and browse to localhost:8082, if you installed Luigi on your own computer, or replace localhost with the server address. You should see the a screen similar to the following screenshot:
Now for the fun part: let's test two dependent tasks! I have an example python script here, and an example dataset here. Change the example_dir in the script to match the location where you put the files and run with python luigi_example.py SummarizeTask. If all went fine you should have two new files in your directory: output.txt - a copy of testdata.csv, and output2.txt - containing the average value of the first column of the dataset, a count of rows, and the directory the files are in.
In the web interface dependency graph looks like this when the tasks are running:
And like this when the tasks have finished:
I plan to play more with this framework, so expect more complex examples soon.