spotify / luigi Goto Github PK

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

License: Apache License 2.0

Python 90.42% JavaScript 7.41% HTML 1.81% Shell 0.12% CSS 0.23%

python luigi orchestration-framework scheduling hadoop

luigi's Introduction

Luigi is a Python (3.6, 3.7, 3.8, 3.9, 3.10, 3.11 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Getting Started

Run pip install luigi to install the latest stable version from PyPI. Documentation for the latest release is hosted on readthedocs.

Run pip install luigi[toml] to install Luigi with TOML-based configs support.

For the bleeding edge code, pip install git+https://github.com/spotify/luigi.git. Bleeding edge documentation is also available.

Background

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

There are other software packages that focus on lower level aspects of data processing, like Hive, Pig, or Cascading. Luigi is not a framework to replace these. Instead it helps you stitch many tasks together, where each task can be a Hive query, a Hadoop job in Java, a Spark job in Scala or Python, a Python snippet, dumping a table from a database, or anything else. It's easy to build up long-running pipelines that comprise thousands of tasks and take days or weeks to complete. Luigi takes care of a lot of the workflow management so that you can focus on the tasks themselves and their dependencies.

You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.

Visualiser page

The Luigi server comes with a web interface too, so you can search and filter among all your tasks.

Dependency graph example

Just to give you an idea of what Luigi does, this is a screen shot from something we are running in production. Using Luigi's visualiser, we get a nice visual overview of the dependency graph of the workflow. Each node represents a task which has to be run. Green tasks are already completed whereas yellow tasks are yet to be run. Most of these tasks are Hadoop jobs, but there are also some things that run locally and build up data files.

Philosophy

Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependencies on other tasks. There are also some similarities to Oozie and Azkaban. One major difference is that Luigi is not just built specifically for Hadoop, and it's easy to extend it with other kinds of tasks.

Everything in Luigi is in Python. Instead of XML configuration or similar external data files, the dependency graph is specified within Python. This makes it easy to build up complex dependency graphs of tasks, where the dependencies can involve date algebra or recursive references to other versions of the same task. However, the workflow can trigger things not in Python, such as running Pig scripts or scp'ing files.

Who uses Luigi?

We use Luigi internally at Spotify to run thousands of tasks every day, organized in complex dependency graphs. Most of these tasks are Hadoop jobs. Luigi provides an infrastructure that powers all kinds of stuff including recommendations, toplists, A/B test analysis, external reports, internal dashboards, etc.

Since Luigi is open source and without any registration walls, the exact number of Luigi users is unknown. But based on the number of unique contributors, we expect hundreds of enterprises to use it. Some users have written blog posts or held presentations about Luigi:

Some more companies are using Luigi but haven't had a chance yet to write about it:

We're more than happy to have your company added here. Just send a PR on GitHub.

External links

Mailing List for discussions and asking questions. (Google Groups)
Releases (PyPI)
Source code (GitHub)
Hubot Integration plugin for Slack, Hipchat, etc (GitHub)

Authors

Luigi was built at Spotify, mainly by Erik Bernhardsson and Elias Freider. Many other people have contributed since open sourcing in late 2012. Arash Rouhani was the chief maintainer from 2015 to 2019, and now Spotify's Data Team maintains Luigi.

luigi's People

Contributors

Stargazers

Watchers

Forkers

zhenghouzz pbarrera farmckon jwills andrefsp jcrobak foursquare jcoveney seshendramln bitly adnam dariobottazzi hidinginabunker dopuskh3 anandaverma joeennever dbunskoek iheartradio kawaa danmbyrd zenweasel anyman freyes rickardcardell alfa07 llethub benjaminhawkeslewis visualdna dj4b1n edhodapp gf-atebbe spil-bahadir teaguesterling enix12enix kmerenkov vivshri parker89 miku ulzha bu2 alexander-kirillov jeremykarn frecon tims dreamfrog quixey a1k0n odise sandlbn pombredanne dattalab strategist922 sisidra interskh grubino baconz davidcoallier mbruggmann udnay alanbbr ruthienachmany maheedhargunturu aurora1625 rantav wcauchois rtdavis22 adregner infoscout jankneumann davefnbuck houzz biswapanda hualet meng-li sheshtawy bugrax marivipelaez robsonfr onlynone mulby iopenstack p5k6 talbright daleroberts theoryno3 a13x mt5225 prizos jeffknupp mschober projectflorida iapilgrim litaoshao markroddy oibe jmmnn buptlishantao dawn110110 karlgrz derekluo

luigi's Issues

document worker config

e.g. the email config isn't documented.

Make logging configurable

Currently the configuration of the logger is done by interface.setup_interface_logging using a hard-coded format. It would be nice if an option was added so that luigi didn't configure the logger and left it up to the user to do, or accepted a configuration file instead of using a hard-coded format. It seems like the easiest thing to do would be to add a flag that could be set before the workflow runs, but it would be more nicer if it all just came from a config file.

HiveQueryTask doesn't use output/isn't atomic

HiveQueryTask assumes that the user will enter an output path/partition/table directly in the query. If you specify an output() anyways (which you still want to, to help the scheduler), and then reference the same output in the hive query the run operation will not be atomic since Hive will directly write to that directory. Since we want to try to keep as much atomic things as possible I think the HiveQueryTask should by default add a temporary output path to the query and then move that result into a final destination once the run has completed successfully.
That way we can get much cleaner queries as well, especially if we do output-type aware insertions, i.e. doing INSERT OVERWRITE DIRECTORY if output is an HdfsTarget, doing LOCAL DIRECTORY if it's a LocalTarget and doing table/partition creation if output is a HiveTableTarget or HivePartitionTarget. I think that would create a much smoother workflow for working with scheduled Hive queries.

support for per-task emails

We'd like to have per-task emails, so that in addition to the global notification list, we can email individuals that "own" particular task directly. We can do this by subclassing or by baking it into the base Task and the Workers. The latter seems nicer, but what do you think?

KeyError with new visualizer

I don't fully understand the cause, but I think that the internal graph state of the luigi scheduler can get into a bad state (where certain tasks are unknown). Here's an example stack trace:

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/tornado/web.py", line 987, in _execute
    getattr(self, self.request.method.lower())(*args, **kwargs)
  File "/data/.../luigi/luigi/server.py", line 51, in get
    result = getattr(self._api, method)(**arguments)
  File "/data/.../luigi/luigi/rpc.py", line 117, in dep_graph
    return self._scheduler.dep_graph(task_id)
  File "/data/.../luigi/luigi/scheduler.py", line 279, in dep_graph
    self._recurse_deps(task_id, serialized)
  File "/data/.../luigi/luigi/scheduler.py", line 273, in _recurse_deps
    self._recurse_deps(dep, serialized)
  File "/data/.../luigi/luigi/scheduler.py", line 273, in _recurse_deps
    self._recurse_deps(dep, serialized)
  File "/data/.../luigi/luigi/scheduler.py", line 269, in _recurse_deps
    task = self._tasks[task_id]
KeyError: u'TablePlusPartition...

In addition the error above, the visualizer doesn't give you any feedback that there was an internal server error, it just appears that the page never loads.

luigi Task class and abc

We have a bunch of abstract base classes for Tasks that wrap up various repeated logic, but need concrete subclasses. Currently, for methods that are "abstract" we just raise a NotImplementedError. But for a number of reasons, it'd be much nicer if we could use ABCMeta.

Could we provide a variant of Register that extends ABCMeta? Or just have Register extend it in the first place? I'm a little out of my element here with python meta classes, so wondering if you guys have some guidance...

top_artists example inconsistent with observed output

I need to write a Task which reads a hdfs.HdfsTarget and writes a LocalTarget. The top_artists.py example seems to do this, but when I try to run it an exception is raised which tells me that the HdfsTarget is a directory. When I look at it in hdfs, it is in fact a directory containing part files (e.g. part-00000) and a file called _SUCCESS.

Checking downstream "dependencies".

I have played around with the Luigi framework and see that it is great at checking "upstream" dependencies: If I run job A, Luigi will check to make sure all other jobs that A depends on is current. The visualization tool also display this dependency graph.

Is there a method or visualization command where I can check for downstream dependencies? That is, for task A, what are the other tasks downstream that depends on task A. This is useful in determing the effect and impact of a task on other tasks.

Enable overriding of retry_delay when running stuff manually

Currently if a task fails you need to wait 15 min (the default retry_delay) before the the central scheduler will let the same or another worker run the task. This is to prevent triggering too many tasks that are failing anyways. However, when runnings things manually this can be quite annoying and you can't always run with --local-scheduler (if you for example have multiple overlapping pipelines).

I suggest adding a --force-retry flag or similar that allows you to override the retry_delay from the client side (but it should only override failed retry, not run the task if some other worker is currently running it, which is already possible through --central-scheduler).

Additionally, I suggest that tasks that are killed by user interrupts (KeyboardInterrupt) should send some kind of message to the scheduler when shutting down, notifying it to not set a retry_delay, since it's probably undesirable to have aborted manual runs prevent scheduled ones.

LocalTarget.exists no-permission fails silently and returns False

The exists check should probably raise an exception if it doesn't have permission to see if the file exist.

Can exception handling be extensible?

When an exception is thrown by a task, the Worker class catches this and calls the send_email function in the notifications module. It would be great if the notification action could be extensible such that a user could add their own implementation of what to do when an exception is thrown. This way other services such as Airbrake and/or PagerDuty could be supported without adding this code into the base project. In the current design I don't see a nice way to do this though so was hoping someone else might have some thoughts on how this could be implemented.....

Deprecate global parameters?

Is anyone using them? I'm not sure if it was a good design decision to have them and they have led to a lot of headache for me at least. Is it OK if we deprecate them with the goal of removing them down the road? Wanted to hear what you think

"Interface" tasks

I have a lot of examples of tasks being used as helper classes, or "interfaces", or "online" classes, whatever you want to call it.

Basically a Task class that also implements methods that other tasks can query. This works great for some stuff.

class MyTask(luigi.Tas):
    def run(self):
        self.read_some_super_big_file_and_create_internal_datastructure()
        self.complete = lambda: True

     def complete(self):
        return False

     def lookup(self, x):
          return self.super_big_datastructure[x]

class OtherTask(luigi.Task):
    def requires(self):
        return MyTask()

    def run(self):
        self.requires().lookup(...)

Anyway, a couple of issues:

You need to manually set the task to completed
These tasks are process-local so they have to re-run in each process

I'm not really sure how to make these work. I was thinking about a stupid helper subclass:

class HelperTask(Task):
    pid = luigi.Parameter(default=os.getpid())
    def complete():
        # do something magical here

But I'm not really sure if that solves all problem (wouldn't work with --workers, for instance, because it forks after the tasks are scheduled).

Any ideas about how this can be achieved? It's very useful for a lot of the stuff I'm working with, but I haven't figured out a clean solution. Right now I just run it with --local-scheduler

Central scheduler crashes on startup if state file broken

If the state (pickle) file somehow gets broken (process gets killed while writing it?) the scheduler will crash on the next startup while trying to read it. We should catch such errors and remove the broken state to allow process managing of the scheduler to auto-restart it in those cases.
This has only ever happened once to me, so it doesn't seem to be a common error.

task complete with no inputs

The definition for Task.complete looks like:

    def complete(self):
        """
            If the task has any outputs, return true if all outputs exists.
            Otherwise, return whether or not the task has run or not
        """
        outputs = flatten(self.output())
        if len(outputs) == 0:
            # TODO: unclear if tasks without outputs should always run or never run
            warnings.warn("Task %r without outputs has no custom complete() method" % self)
            return False

        for output in outputs:
            if not output.exists():
                return False
        else:
            return True

The docstring doesn't quite match the implementation. We have several tasks that it would be useful to have them run once per day, only if not yet run (e.g. hadoop fsck, cleanup jobs to gc old files, etc). It might be harder to track, but what do you think about adding that into luigi, or should I save state in HDFS or something like that?

luigid state file should have configurable path

It should be possible to configure a path for the luigid state file instead of hard-coding it.

Scheduler not working for jobs with many dependencies

The scheduler appears to not be working. I have a task with many dependencies (a little more than two years' data for on Task). The task has been part of the cron pipeline and was working fine. However, in the past week, the scheduler will schedule the base task, successfully run it, and stop. For example, I get the following:

INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 20
INFO: [pid 6448] Running
INFO: 13/07/31 11:59:27 INFO streaming.StreamJob: map 0% reduce 0%
INFO: 13/07/31 12:00:09 INFO streaming.StreamJob: map 7% reduce 0%
...
INFO: 13/07/31 12:19:21 INFO streaming.StreamJob: map 100% reduce 100%
INFO: 13/07/31 12:19:55 INFO streaming.StreamJob: Job complete
INFO: [pid 6448] Done
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time

Faster UI with thousands of tasks

See discussion in #111

visualizer failing regular expression search

i had to update the regular expression at server.py:115 to "graph0" instead of "graph1"

Scheduler colors are bad for colorblindness

It's hard to tell the difference between pending, failed, and complete tasks, especially with a bunch of them on top of each other.

With deuteranopia (common):

and protanopia (rarer):

Maybe add check marks to done, dashes to pending, and x's to failed?

Cannot pass BooleanParameter in as argument from commandline

With a very simple class defined as:

import luigi
class TestBooleanParameter(luigi.Task):
    switch = luigi.BooleanParameter()

    def run(self):
        print self.switch

And run from the commandline as either

python test.py TestBooleanParameter --switch=True

python test.py TestBooleanParameter --switch=1

Fails with:

test.py TestBooleanParameter: error: argument --switch: ignored explicit argument 'True'

or, alternatively

test.py TestBooleanParameter: error: argument --switch: ignored explicit argument '1'

Scalding jobs fail with object has no attribute 'reducer'

Runtime error:
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/luigi/worker.py", line 229, in _run_task
task.run()
File "/usr/lib/python2.6/dist-packages/luigi/hadoop.py", line 507, in run
self.job_runner().run_job(self)
File "/usr/lib/python2.6/dist-packages/luigi/scalding.py", line 180, in run_job
arglist += ['-D%s' % c for c in job.jobconfs()]
File "/usr/lib/python2.6/dist-packages/luigi/hadoop.py", line 478, in jobconfs
if self.reducer == NotImplemented:
AttributeError: 'UserVectorArtistAggregate' object has no attribute 'reducer'

Issues with tasks containing spaces/parentheses/etc

See https://groups.google.com/forum/#!topic/luigi-user/TWwAtynX7Qc

and #125

Don't use hard-coded tmp_dir, use tempfile.mkdtemp

For your consideration:

Running luigi-based hadoop jobs on a system by more than one user inevitably causes problems. Instead of using a configurable location to place temporary directories, it's best practices to use the built-in facilities. In this case, that means using tempfiles.mkdtemp.

diff --git a/luigi/hadoop.py b/luigi/hadoop.py
index de654cd..6b58afc 100644
--- a/luigi/hadoop.py
+++ b/luigi/hadoop.py
@@ -297,10 +297,8 @@ class HadoopJobRunner(JobRunner):
         if runner_path.endswith("pyc"):
             runner_path = runner_path[:-3] + "py"

-        base_tmp_dir = configuration.get_config().get('core', 'tmp-dir', '/tmp/luigi')
-        self.tmp_dir = os.path.join(base_tmp_dir, 'hadoop_job_%016x' % random.getrandbits(64))
+        self.tmp_dir = tempfile.mkdtemp()
         logger.debug("Tmp dir: %s", self.tmp_dir)
-        os.makedirs(self.tmp_dir)

         # build arguments
         map_cmd = 'python mrrunner.py map'
@@ -381,6 +379,7 @@ class HadoopJobRunner(JobRunner):
         self.finish()

     def finish(self):
+        # FIXME: check for isdir?
         if self.tmp_dir and os.path.exists(self.tmp_dir):
             logger.debug('Removing directory %s', self.tmp_dir)
             shutil.rmtree(self.tmp_dir)

self in static method

https://github.com/spotify/luigi/blob/master/luigi/hadoop.py#L279
self.fetch_task_failures(tracking_url) should be fetch_task_failures

'maximum recursion depth exceeded' when running unit tests

I ran unit tests with python test/test.py, and the tests seem to get into an infinite loop:

.mkdir: cannot create directory /tmp: File exists
EException RuntimeError: 'maximum recursion depth exceeded in __subclasscheck__' in <type 'exceptions.AttributeError'> ignored
Exception AttributeError: "'HdfsAtomicWritePipe' object has no attribute '_process'" in <bound method HdfsAtomicWritePipe.__del__ of <luigi.hdfs.HdfsAtomicWritePipe object at 0x26427d0>> ignored

It output the last two messages over and over again until I control-c'd, which showed:

^CException KeyboardInterrupt in <bound method HdfsAtomicWritePipe.__del__ of <luigi.hdfs.HdfsAtomicWritePipe object at 0x26427d0>> ignored

I'm using the CDH hadoop distro (3u3) and python2.6. Are these supported?

BaseHadoopJobTask always turns off reducers for hadoop jar tasks

BaseHadoopJobTask.jobconfs sets "mapred.reduce.tasks=0" if reducer == NotImplemented. Subclasses of HadoopJarJobTask would usually never override reducer, since the reducer is defined in the jar that it executes.

A hacky fix for this would be to set reducer = None in HadoopJarJobTask. Probably better would be to split up BaseHadoopJobTask, with one class providing generic jobconfs and the like, and PythonHadoopJobTask and a JavaHadoopJobTask classes that extend it.

Thoughts?

python 2.7 incompatibilities

I can run the tests with python 2.6 but not python 2.7. With python 2.7, several tests in _hdfs_tests.py fail:

======================================================================
ERROR: test_with_noclose (_hdfs_test.AtomicHdfsOutputPipeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/joe/Code/luigi/test/_hdfs_test.py", line 53, in test_with_noclose
    self.assertRaises(TestException, foo)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 471, in assertRaises
    callableObj(*args, **kwargs)
  File "/Users/joe/Code/luigi/test/_hdfs_test.py", line 50, in foo
    with hdfs.HdfsAtomicWritePipe(testpath) as fobj:
AttributeError: __enter__

======================================================================
ERROR: test_glob_exists (_hdfs_test.HdfsTargetTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/joe/Code/luigi/test/_hdfs_test.py", line 190, in test_glob_exists
    with t1.open('w') as f:
AttributeError: __enter__

That's just a small sample but gives you an idea of the stacktrace. My python fu isn't really that great, so I'm not sure why it works on python 2.6 at all. Is fixing the 2.7 compatibility just a matter of adding enter and close ?

Unable to run hadoop map reduce job using luigi

unable to get line from hdfs input file. output:

-------------
To: ('[email protected]',)
From: [email protected]
Subject: Luigi: Ngrams(source=/user/root/input.txt, destination=/user/root/test, n_reduce_tasks=10) FAILED
Message:
Hadoop job failed with message: Streaming job failed with exit code 1. Output from tasks below:
---------- http://localhost:50060/tasklog?attemptid=attempt_201310101312_0005_m_000000_0&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000000_3&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000001_0&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000001_3&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000000_1&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000001_1&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000000_2&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax

---------- http://infotrellis:50060/tasklog?attemptid=attempt_201310101312_0005_m_000001_2&start=-100000:
Traceback (most recent call last):
  File "mrrunner.py", line 78, in main
    Runner().run(kind, stdin=stdin, stdout=stdout)
  File "mrrunner.py", line 43, in run
    self.job._run_combiner(stdin, stdout)
  File "luigi/hadoop.py", line 727, in _run_combiner
    self.internal_writer(outputs, stdout)
  File "luigi/hadoop.py", line 737, in internal_writer
    for output in outputs:
  File "luigi/hadoop.py", line 698, in _reduce_input
    for key, values in groupby(inputs, itemgetter(0)):
  File "luigi/hadoop.py", line 733, in internal_reader
    yield map(eval, input.split("\t"))
  File "<string>", line 1
    Input File : hdfs://infotrellis:8020/user/root/input.txt
             ^
SyntaxError: invalid syntax


    stdout:
    packageJobJar: [/usr/local/lib/python2.7/dist-packages/luigi-1.0.8-py2.7.egg/luigi/mrrunner.py, /tmp/luigi/hadoop_job_2892438f6b46992f/packages.tar, /tmp/luigi/hadoop_job_2892438f6b46992f/job-instance.pickle, /tmp/hadoop-root-tmp/hadoop-unjar424667304590478725/] [] /tmp/streamjob6677409785224790392.jar tmpDir=null



    stderr:
    13/10/10 16:28:53 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/10 16:28:53 INFO mapred.FileInputFormat: Total input paths to process : 1
13/10/10 16:28:54 INFO streaming.StreamJob: getLocalDirs(): [/data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local]
13/10/10 16:28:54 INFO streaming.StreamJob: Running job: job_201310101312_0005
13/10/10 16:28:54 INFO streaming.StreamJob: To kill this job, run:
13/10/10 16:28:54 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce//bin/hadoop job  -Dmapred.job.tracker=infotrellis:8021 -kill job_201310101312_0005
13/10/10 16:28:54 INFO streaming.StreamJob: Tracking URL: http://infotrellis:50030/jobdetails.jsp?jobid=job_201310101312_0005
13/10/10 16:28:55 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/10 16:30:41 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:31:11 INFO streaming.StreamJob:  map 2%  reduce 0%
13/10/10 16:31:26 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:31:28 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/10 16:31:47 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:31:50 INFO streaming.StreamJob:  map 2%  reduce 0%
13/10/10 16:31:55 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/10 16:32:13 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:32:16 INFO streaming.StreamJob:  map 2%  reduce 0%
13/10/10 16:32:20 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/10 16:32:41 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:32:44 INFO streaming.StreamJob:  map 2%  reduce 0%
13/10/10 16:32:50 INFO streaming.StreamJob:  map 1%  reduce 0%
13/10/10 16:33:00 INFO streaming.StreamJob:  map 100%  reduce 100%
13/10/10 16:33:00 INFO streaming.StreamJob: To kill this job, run:
13/10/10 16:33:00 INFO streaming.StreamJob: /usr/lib/hadoop-0.20-mapreduce//bin/hadoop job  -Dmapred.job.tracker=infotrellis:8021 -kill job_201310101312_0005
13/10/10 16:33:00 INFO streaming.StreamJob: Tracking URL: http://infotrellis:50030/jobdetails.jsp?jobid=job_201310101312_0005
13/10/10 16:33:00 ERROR streaming.StreamJob: Job not successful. Error: NA
13/10/10 16:33:00 INFO streaming.StreamJob: killJob...
Streaming Command Failed!


-------------
INFO: Not sending email when running from a tty or in debug mode
DEBUG: Removing directory /tmp/luigi/hadoop_job_2892438f6b46992f
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread

I didn't understand why it is trying to read line and split with '\t'
and i am always seeing "INFO: Not sending email when running from a tty or in debug mode".

luigi should allow for a user level config, not just a root path and system level

from interface.py, you can set a client configuration here:
_config_paths = ['/etc/luigi/client.cfg', 'client.cfg']

We should also support a ~/.luigi/client.cfg type situation

Move all non-core things to luigi.toolbox

Or luigi.plugins or luigi.extra or whatever. This includes

hadoop
hive
hdfs
scalding
file (?)
postgres

where is `./bin/spluigid` from the docs?

I don't see it before or after an install. Is the proper way to just run server.py?

Execution should begin in parallel with scheduling

Workers quit before all pending tasks are done

... and as a result some tasks are never run and eventually expired.

Consider the following set of tasks:

class Block(luigi.Task):
  def run(self):
    # sleep for 20 seconds
  def complete(self):
    # return true after it has slept for 20 seconds

class Dep(luigi.Task):
  def requires(self):
    return Block()
  def run(self):
    # do something, such as creating a file

Submit Block, then in another terminal submit Dep while Block is running.

$> ./block_task.py Block
$> ./block_task.py Dep

INFO: There are no more tasks to run at this time
INFO: Block() is currently run by worker worker-428203729
INFO: Worker was stopped. Shutting down Keep-Alive thread

Now visit the scheduler's web UI at http://localhost:8082. We see that Dep is still pending, but it never gets run because all workers have exited.

I believe this is related to the following comment in luigi/worker.py.

# TODO: sleep for a bit and query server again if there are
# pending tasks in the future we might be able to run

No timestamps in server logs

There are no timestamps in server logs (/var/log/luigi) and there is a single output file instead of hourly/daily spooled log files making it extremely difficult to dig through logs for specific events.

There should be a way to specify jobs which could affect other jobs

Imagine you have a job where if it has run on date A, then it doesn't need to be run for any previous date. If you schedule a bunch of these, however, they will all be scheduled and run. You can manually short circuit the run with a check, but it would be nice to have a cleaner way to deal with these sort of situations.

Passing lists in on the command-line doesn't work properly

If you have a parameter that's declared as a list, e.g.

import luigi
class ListTest(luigi.Task):
    my_list = luigi.Parameter([1,2,3], is_list=True)
    def run(self):
        print "\n"*5
        print self.my_list
        print "\n"*5

As long as you don't touch the defaults, or you pass in lists in Python, everything works fine.

python list_test.py ListTest --local-scheduler

prints (without the extra scheduling stuff)

(1, 2, 3)

But when you try to pass in the list in the command-line, the list is never parsed properly.

python list_test.py ListTest --my-list=[4,5,6] --local-scheduler

('[4,5,6]',)

I've tried all brands of character-escaping, but the list always comes in as a string. A solution I'm using as a stopgap is this:

import ast
class TupleParameter(luigi.Parameter):
    def parse(self, x):
        return tuple(ast.literal_eval(x))


class ListTestWorking(luigi.Task):
    my_list = TupleParameter([1,2,3])
    def run(self):
        print "\n"
        print self.my_list
        print "\n"

python list_test.py ListTestWorking --my-list=[4,5,6] --local-scheduler

now works properly

(4, 5, 6)

ast.literal_eval is supposed to be a "safe" version of eval, meaning it'll only work on valid Python literal structures.

A wrapper for a wrapper checks all dependencies twice

In the following setup:

Wrapper1(WrapperTask):
    requires(self):
        Wrapper2()

Wrapper2(WrapperTask):
    requires(self):
        [a for a in a_bunch_of_stuff()]

Dependencies of Wrapper2 will be checked twice, once when evaluating Wrapper1, once when evaluating Wrapper2

Example log output:

Checking if Wrapper1 is complete
Checking for flag at /a/b/c/_SUCCESS # dependency of Wrapper2
... # more checks
Scheduled Wrapper1
Checking if Wrapper2 is complete
Checking for flag at /a/b/c/_SUCCESS # same check again!
... # repeated checks
Scheduled Wrapper2

This isn't a bug pre-se, but more a behavior that can really slow stuff down if dependency checking takes a while (like when talking to hive).

HiveQueryTask doesn't create parent dirs

Getting the "Failed with exception Unable to rename:" if the parent directory of the output catalog doesn't exist when running Hive queries through HiveQueryTask. Should probably create the dir before running the query if it doesn't exist.

Fix up tornado import guards

Current tornado import guards have untyped exception catching around large blocks of code. This is quite ugly and was a quick fix to make MR jobs working on hadoop clusters without tornado installed. We should clean it up by re-arranging imports so tornado isn't imported/used just because somebody imports the top level luigi module. Other ideas also appreciated.

Deadlock in unit tests

There is a race condition somewhere in code (probably has to do with the ping thread) that causes luigi to deadlock sometimes. This happens occationally when running the unit tests through nosetests (noticed as failed travis tests on GH where tests take more than 10 min to complete).

I have occationally seen this locally on my machine as well:

INFO: [pid 54810] Running DummyTask(id=18) INFO: [pid 54810] Done DummyTask(id=18) DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 54810] Running DummyTask(id=19) INFO: [pid 54810] Done DummyTask(id=19) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread .DEBUG: Checking if DummyTask(id=0) is complete DEBUG: Checking if DummyTask(id=1) is complete DEBUG: Checking if DummyTask(id=2) is complete DEBUG: Checking if DummyTask(id=3) is complete DEBUG: Checking if DummyTask(id=4) is complete DEBUG: Checking if DummyTask(id=5) is complete DEBUG: Checking if DummyTask(id=6) is complete DEBUG: Checking if DummyTask(id=7) is complete DEBUG: Checking if DummyTask(id=8) is complete DEBUG: Checking if DummyTask(id=9) is complete DEBUG: Checking if DummyTask(id=10) is complete DEBUG: Checking if DummyTask(id=11) is complete DEBUG: Checking if DummyTask(id=12) is complete DEBUG: Checking if DummyTask(id=13) is complete DEBUG: Checking if DummyTask(id=14) is complete DEBUG: Checking if DummyTask(id=15) is complete DEBUG: Checking if DummyTask(id=16) is complete DEBUG: Checking if DummyTask(id=17) is complete DEBUG: Checking if DummyTask(id=18) is complete DEBUG: Checking if DummyTask(id=19) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread ...
DEADLOCK HERE (above three dots are written by nosetests, i.e. completed tests)

hadoop mapper should be able to return None to filter input rows

The current implementation of the map/reduce framework requires mapper to output something. if mapper yields None, _map_input will yield None, which then percolates up to either internal_writer or writer, which then writes out either repr(None) (when internal_writer is being used) or raises an Exception (when writer is being used). Yielding None from mapper is a convenient way to filter data -- not every input row needs to produce an output row.

In my opinion, _map_input should probably do something like this:

...
for output in self.mapper(*record):
    if output:
        yield output

The same thing is true for _reduce_input.
What do you think?

RawConfigParser -> ConfigParser

Is there a reason that LuigiConfigParser extends RawConfigParser instead of ConfigParser ? It'd be nice to be able to use "the magical interpolation feature" from: http://docs.python.org/2/library/configparser.html#ConfigParser.ConfigParser

luigi-grep

Apart from the awesome web interface that will be created at some point, I think it would be awesome (and easy to implement) with a CLI tool for searching the luigi graph for tasks matching a pattern.

I was thinking something like this:
$ luigi-grep "Aggregate.*date=2013"
AggregateLogs(date=2013-01-05) RUNNING
AggregateLogs(date=2013-01-04) PENDING
AggregateLogs(date=2013-01-03) PENDING
AggregateLogs(date=2013-01-02) DONE

Adding something like an optional "blocked by" column in the output would also be great:
AggregateLogs(date=2013-01-03) PENDING blocked by (RUNNING|PENDING|FAILED) FetchSyslogs(date=2013-01-03)

Doing this via the luigi rest interface /api/graph call and some simple graph traversal should be almost trivial...

setup borks trying to open missing README.md

The luigi 1.0.1 package on PyPI doesn't include README.md, which setup.py tries to read.

$ pip install luigi
Unpacking /tmp/luigi-1.0.1
  Running setup.py egg_info for package from file:///tmp/luigi-1.0.1
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/tmp/pip-M7TD_D-build/setup.py", line 21, in <module>
        for line in open('README.md'):
    IOError: [Errno 2] No such file or directory: 'README.md'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/tmp/pip-M7TD_D-build/setup.py", line 21, in <module>

    for line in open('README.md'):

IOError: [Errno 2] No such file or directory: 'README.md'

RFC: Luigi Task History

Problem

The luigi scheduler's main purpose is to act as a locking service for running jobs. Its secondary function is to provide a web-interface to view the dependency luigi dependency graph. There are a few pieces of important information that it doesn't provide, such as:

A historic list of runs. In fact, the scheduler aggressive prunes finished tasks in order to keep the dependency graph reasonable.
Per-task run times, or per-task run histories (time, status, which host it ran on, etc).
A link from a running task to the JobTracker details page for that task.
A mechanism to investigate the worker logs for a task.

Data Recording

There are a few potential mechanisms to get at this data:

Job History

The Hadoop JobTracker History has much of this data. Luigi sets mapred.job.name (unless the sub-task overrides) to the same name as the job itself. The job names could be indexed, which would give us a mechanism to investigate 1, 2, and 3. This solution doesn't work if the user overrides mapred.job.name or in the case of a job that spawns multiple mapreduce jobs (i.e. hive or pig).

Scheduler recording

The scheduler could record much of this data, to a local database or file. Unfortunately, the scheduler doesn't currently have access to information like which mapreduce jobs were generated or the task output. This data could potentially be uploaded from the workers to the scheduler on completion or failure.

Worker recording

The workers have all the information about a run. In addition, the workers already sort of have an event system -- at least we have on_complete() and on_failure(). So we could plug some (or all) of this logic into those methods, although it might also be nice to have updates while Tasks are running (particularly to get links to the jobs that they're running and current job logs).

Data Storage

This data would fit well in either a relational database or a column-family database like hbase or cassandra.

Relational

This is a strawman proposal of a relational database schema
Task(
name String,
id int auto increment primary key,
host String,
)

TaskParameters(
taskId foreign key Task.id,
name string,
value string)

TaskEvents(
taskId foreign key Task.id,
eventName String, -- READY, START, END, FAIL
ts Timestamp
)

TaskHadoopJobs(
taskId foreign key Task.id,
id String,
url String,
startTime ts,
endTime ts,
status String)

TaskLogs(
taskId foreign key Task.id,
logType String, // stderr, sdout
uri String // uri in HDFS or somewhere else.
)

Column-family

Task table with 'details' (Task), 'parameters' (TaskParameters), 'events' (TaskEvents), 'hadoopJobs' (TaskHadoopJobs) and logs (TaskLogs),

UI

I'm envisioning a table-based UI that ignores the dependency graph and only shows the status per task. For example, we have something similar for workflows in oozie: http://cl.ly/image/343k3h2y2u0L (the table would look slightly different due to the differences between oozie and luigi).

This could live separately from the luigi scheduler or be deployed alongside it a la the api server.

Possible extensions

A "predicted" or "estimated" time based upon historical data.
...

Document new scheduler UI

new screenshots
port change to 8082.
document how to see the graph rooted at a particular task.

Visualizer does a bad job rendering large graphs

For graphs with > 100 vertices, the visualizer produces a very messy layout that is hard to follow (depending on how many edges you have).

For graphs with > 1000 vertices, it's painfully slow to render the page.

We have been thinking about solving this for a very long time at Spotify :)

Some ideas include:

Only draw currently running tasks + failed tasks + all nodes within 1 step
Show traceback of failed tasks
Include timestamps for done tasks (and in the future also ETA for pending tasks)
Only show info for each worker, with links to see more stuff
Not using GraphViz (but instead some JS graph library?)
...

LocalTarget(is_tmp=True) always reports as complete

Perhaps I'm doing something wrong, but it seems that tasks with an output of LocalTarget(is_tmp=True) are always considered complete .. which is of course a problem because it means that they never run. For example, consider running the following:

from luigi import Task, LocalTarget, run

class A(Task):

    def run(self):
        f = self.output.open('w')
        f.write('message in a bottle')
        f.close()

    def output(self):
        return LocalTarget(is_tmp=True)

class B(Task):

    def requires(self):
        return A()

    def run(self):
        f = self.input().open()
        print f.read() # will print an empty line
        f.close()

if __name__ == '__main__':
    run()

or, from an ipython session, type

from luigi import LocalTarget
LocalTarget(is_tmp=True).exists() # True

I think that this problem exists because a tempfile is created when the LocalTarget is instantiated, and consequently it exists ... and the task is considered complete... you get the idea. I'm not sure yet what the best fix would be

Is there any way to circumvent this? Is anybody using the is_tmp argument with any success?

Failed dependency reported only once

Failed dependency is reported only once even if it was for different parent job.

class TestDepFail(luigi.ExternalTask):
    def complete(self): return False
class TestJob(luigi.Task):
    param = luigi.IntParameter()
    def complete(self): return False
    def requires(self): return TestDepFail()
class TestWrapJob(luigi.WrapperTask):
    def requires(self): return [TestJob(11111), TestJob(22222)]

output is:

DEBUG: Checking if TestWrapJob() is complete
INFO: Scheduled TestWrapJob()
DEBUG: Checking if TestJob(param=22222) is complete
INFO: Scheduled TestJob(param=22222)
DEBUG: Checking if TestDepFail() is complete
WARNING: Task TestDepFail() is not complete and run() is not implemented. Probably a missing external dependency.
DEBUG: Checking if TestJob(param=11111) is complete
INFO: Scheduled TestJob(param=11111)
INFO: Done scheduling tasks

In my case it was 4 wrapped tasks each had 90+ dependencies with one failing. Just by looking at msgs for last wrapped job it is impossible to identify problem for that job.