GithubHelp home page GithubHelp logo

fbrubacher / lemur Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mlimotte/lemur

0.0 1.0 0.0 228 KB

Lemur is a tool to launch hadoop jobs locally or on EMR, based on a configuration file, referred to as a jobdef. The jobdef file describes your EMR cluster, local environment, pre- and post-actions and zero or more "steps".

License: Apache License 2.0

Clojure 99.07% Shell 0.93%

lemur's Introduction

Overview

Lemur is a tool to launch hadoop jobs locally or on EMR, based on a configuration file, referred to as a jobdef. The jobdef file describes your EMR cluster, local environment, pre- and post-actions (aka hooks) and zero or more "steps". A step is Amazon's name for a task or job submitted to the cluster. Lemur reads your jobdef, at the end of your jobdef, you execute (fire! ...) to make things happen. Also keep in mind that the jobdef is an interpreted clj file, so you can insert arbitrary Clojure code to be executed anywhere in the file (but see HOOKS below for a better way).

Features
  • Launch EMR cluster and submit step(s); or run against local hadoop (usually hadoop standalone for dev and testing)
  • Basic configuration options include: -- Bootstrap actions -- Hadoop config -- Uploads (files to transfer to S3, or local) -- Cluster details (num instances, master instance type, etc) -- Output paths to use for data, logs, main jar, etc. -- Support for spot market instances
  • Profile support provides packages of options and functionality that can be enabled or disabled (e.g. you can have a :test profile or a :live profile)
  • Validation for your command line options and environment before launching EMR and your job
  • Override configured options via command line
  • Hooks for actions that should be triggered before or after job launch (e.g. one hook in use at Climate Corporation does a diff on the results of a local run, as a full integration test. Another hook, posts a detailed message to IRC-- hipchat-- when a new job is started)
  • Optionally wait for an EMR job to complete
  • A dry-run feature, so you can check the final cluster configuration, arguments that will be sent to your hadoop main, etc.
  • All the details from dry-run (cluster/step config, etc) are persisted with each job run
  • All settings can be literal values, interpolated strings (e.g. set the S3 bucket as "com.your-co.${env}.hadoop"), or functions for ultimate flexibility
  • Import common options, functionality and behavior to avoid duplication (i.e. DRY principle)
  • Pass-through command-line options, allows you to specify extra args on the command line that are meaningful to your hadoop main function, but are unknown to lemur or your jobdef

A Note About the Ruby elastic-mapreduce CLI tool

Lemur does not try to replace elastic-mapreduce. While there is some overlap, lemur is focused on launching. It provides no replacement for many common activities that you will find in elastic-mapreduce. For example, "elastic-mapreduce --list". I recommend that you install elastic-mapreduce along-side lemur (or rely on the AWS Console for those activities).

Installation

  1. Download the tar-gzip from the GitHub Downloads link
  2. Expand into some install location
  3. set LEMUR_HOME to the top of the install path
  4. set LEMUR_EXTRA_CLASSPATH to any classpath entries (colon separated) that you want lemur to include when it runs your jobdef. The classpath that includes you base files, or other functions or libraries for use by your jobdefs for example.
  5. [optional] set AWS_CREDENTIAL_FILE to a file with AWS credentials (see AWS Credentials below).

AWS Credentials

Interestingly, the various AWS services' supporting command-line tools all have different methods for getting access-key and secret-key.

elastic-mapreduce uses a JSON file. CloudWatch, CloudSearch use a properties file identified by AWS_CREDENTIAL_FILE (although the key names are different in each case), and s3cmd looks for yet another properties file in ~/.s3cfg.

You can explicitly set the creds with Environment variables: LEMUR_AWS_ACCESS_KEY and LEMUR_AWS_SECRET_KEY.

Alternatively, Lemur will accept credentials in any of the file formats above. You can set the AWS_CREDENTIAL_FILE environment variable to a path of one of those files. Or, it will look in either the PWD for credentials.json; or in which elastic-mapreduce/credentials.json. If you want more detail, see com.climate.services.aws.common/aws-credential-discovery in this package.

For reference, the JSON format is:

{"access_id": "EXAMPLEDV82HJBSHFAKE",
 "private_key": "Sample/GudsbGjjJuz0gf6asdgvxasdasdv521gd"}

Compatibility

v0.9.7 Clojure 1.2

v1.0.1+ Clojure 1.3

I've used lemur on Mac OS X and Linux. It MAY work on Windows (if you use cygwin). If you try it on Windows, I would be interested in hearing about your experience (patches welcome).

Usage

The general command line format is:

bin/lemur <command> <jobdef-file> [options] [remaining]

bin/lemur help                    - display this help text
bin/lemur run ./jobdef.clj        - Run a job on EMR
bin/lemur dry-run ./jobdef.clj    - Dry-run, i.e. just print out what would be done
bin/lemur start ./jobdef.clj      - Start an EMR cluster, but don't run the steps (jobs)
bin/lemur local ./jobdef.clj      - Run the job using local hadoop (e.g. standalone mode)
Examples
lemur run clj/wb-clj/scripts/launch/hrap-jobdef.clj --dataset ahps --num-days 10
lemur start clj/wb-clj/src/weatherbill/lemur/sample-jobdef.clj

Help

Feedback and feature requests are welcome!

lemur's People

Contributors

acrao avatar bzimmer avatar mlimotte avatar monodeldiablo avatar ndimiduk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.