GithubHelp home page GithubHelp logo

factual-drake's Introduction

Drake

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

Drake walk-through

If you like screencasts, check out this Drake walk-through video recorded by Artem Boytsov, Drake's primary designer:

Installation

You will need to have a JVM installed. Drake has been tested under Linux, Mac OS X and Windows 8. We've not tested it on other operating systems.

Download or build the uberjar

You can build Drake from source, which is the preferred way to run the most up-to-date version, or you can download a prebuilt uberjar ,which may not be the most recent version of Drake.

Following are instructions for building from source. Drake is a Clojure project, so you will need to have leiningen.

Clone the project:

$ git clone [email protected]:Factual/drake.git
$ cd drake

Build the uberjar:

$ lein uberjar

Run Drake from the uberjar

Once you've built or downloaded the uberjar, you can run Drake like this:

$ java -jar drake.jar

You can pass in arguments and options to Drake by putting them at the end of the above command, e.g.:

$ java -jar drake.jar --version

A nicer way to run Drake

We recommend you "install" Drake in your environment so that you can run it by just typing "drake". Here's a convenience script you can put on your path:

#!/bin/bash
java -cp $(dirname $0)/drake.jar drake.core "$@"

Save that as drake, then do chmod 755 drake. Move the uberjar to be in the same directory. Now you can just type drake to run Drake from anywhere.

Faster startup time

The JVM startup time can be a nuisance. To reduce startup time, we recommend using the way cool Drip. Please see the Drake with Drip wiki page.

Basic Usage

Drake documentation refers to running Drake as "drake". If you are instead running the uberjar, just replace "drake" with "java -jar drake.jar" in the examples.

The wiki is the home for Drake's documentation, but here are simple notes on usage:

To build a specific target (and any out-of-date dependencies, if necessary):

$ drake mytarget

To build a target and everything that depends on it (a.k.a. "down-tree" mode):

$ drake ^mytarget

To build a specific target only, without any dependencies, up or down the tree:

$ drake =mytarget

To force build a target:

$ drake +mytarget

To force build a target and all its downtree dependencies:

$ drake +^mytarget

To force build the entire workflow:

$ drake +...

To exclude targets:

$ drake ... -sometarget -anothertarget

By default, Drake will look for ./Drakefile. The simplest way to run your workflow is to name your workflow file Drakefile, and make sure you're in the same directory. Then, simply:

$ drake

To specify the workflow file explicitly, use -w or --workflow. E.g.:

$ drake -w /myworkflow/my-workflow.drake

Use drake --help for the full list of options.

Documentation, etc.

The wiki is the home for Drake's documentation.

A lot of work went into designing and specifying Drake. To prove it, here's the 60 page specification and user manual. It's stored in Google Docs, and we encourage everyone to use its superb commenting feature to provide feedback. Just select the text you want to comment on, and click Insert -> Comment (Ctrl + Alt + M on Windows, Cmd + Option + M on Mac). It can also be downloaded as a PDF.

There are annotated workflow examples in the demos directory.

There's a Google Group for Drake where you can ask questions. And if you found a bug or want to submit a feature request, go to Drake's GitHub issues page.

Asynchronous Execution of Steps

Please see the wiki page on async.

Plugins

Drake has a plugin mechanism, allowing developers to publish and use custom plugins that extend Drake. See the Plugin wiki page for details.

HDFS Compatibility

Drake provides HDFS support by allowing you to specify inputs and outputs like hdfs:/my/big_file.txt.

If you plan to use Drake with HDFS, please see the wiki page on HDFS Compatibility.

Amazon S3 Compatibility

Thanks to Chris Howe, Drake now has basic compatibility with Amazon S3 by allowing you to specify inputs and outputs like s3://bucket/path/to/object.

If you plan to use Drake with S3, please see the wiki doc on S3 Compatibility.

Drake on the REPL

You can use Drake from your Clojure REPL, via drake.core/run-workflow. Please see the Drake on the REPL wiki page for more details.

Stuff outside this repo

Thanks to Lars Yencken, we now have Vim syntax support for Drake:

Also thanks to Lars Yencken, utilities for making life easier in Python with Drake workflows.

Courtesy of @daguar, an alternative approach to installing Drake on Mac OS X.

License

Source Copyright © 2012-2013 Factual, Inc.

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.

factual-drake's People

Contributors

aboytsov avatar stanistan avatar agate avatar guillaume avatar ash211 avatar myronahn avatar reckbo avatar

Watchers

Navid Nikpour avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.