GithubHelp home page GithubHelp logo

ngr-t / luigi_for_data_science Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 2.0 17 KB

Extension of luigi.Task more suitable for reproducable data analysis workflows

License: MIT License

Python 100.00%
python luigi-workflows workflow-engine data-science

luigi_for_data_science's Introduction

Luigi for reproducible data analysis workflow

luigi is a very powerful DAG workflow manager with strong extensibility. It's useful to build data analysis pipelines, but some part of its default operation is unfavorable from the view of reproducibility and consistency. The point is that luigi.Task by default checks only the existence of output object, therefore it's considered as completed if inputs changed but output object exists.

Here I present an extension of luigi.Task more suitable for reproducable data analysis workflows. It override complete method of luigi.Task as to compare the hash values of inputs to those of previous run.

Thanks to the luigi team.

How to use?

  1. Make your tasks inherit hash_checking_tasks.TaskWithCheckingInputHash
  2. Make the task's output and all the input inherit hash_checking_tasks.HashableTarget.
  3. Run.

How does it work?

TaskWithCheckingInputHash is an extension of luigi.Task with below operation:

  • check the dependent tasks' completeness in complete() method.
  • check if the input of previous run is equal to that of the current run.
  • if the run is successful, store the information about the task.

TaskWithCheckingInputHash rely on HashableTarget that:

  • we can check the equality of the content of targets by comparing the values of hash_content().
  • we can retrieve the information about the Task which made the current output (if exists) by get_current_input_hash()
  • we can store the information about the Task which made the output by store_input_hash()

TODO:

  • Docstrings for the whole public methods.

luigi_for_data_science's People

Contributors

ngr-t avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.