GithubHelp home page GithubHelp logo

alexott / clojure-hadoop Goto Github PK

View Code? Open in Web Editor NEW

This project forked from stuartsierra/clojure-hadoop

98.0 11.0 22.0 596 KB

Library to aid writing Hadoop jobs in Clojure.

License: Eclipse Public License 1.0

Clojure 100.00%

clojure-hadoop's Introduction

clojure-hadoop

A library to assist in writing Hadoop MapReduce jobs in Clojure.

Originally written by Stuart Sierra (http://stuartsierra.com/).
Extended by Roman Scherer, Christopher Miles, Ian Eslick, 
Dave Lambert, Alex Ott, and other.

Stable releases are available via http://clojars.org

For more information
on Clojure, http://clojure.org/
on Hadoop, http://hadoop.apache.org/

Also see Stuart's presentation about this library at
http://vimeo.com/7669741

Introduction to work with library is available at
http://alexott.net/en/clojure/ClojureHadoop.html

Copyright (c) Stuart Sierra, 2009. All rights reserved.  The use and
distribution terms for this software are covered by the Eclipse Public
License 1.0 (http://opensource.org/licenses/eclipse-1.0.php) which can
be found in the file LICENSE.html at the root of this distribution.
By using this software in any fashion, you are agreeing to be bound by
the terms of this license.  You must not remove this notice, or any
other, from this software.


DEPENDENCIES

This library requires Java 6 JDK, http://java.sun.com/

Building from source requires Leiningen, http://github.com/technomancy/leiningen


BUILDING

If you downloaded the library distribution as a .zip or .tar file,
everything is pre-built and there is nothing you need to do.

If you downloaded the sources from Git, then you need to run the build
with Leiningen. In the top-level directory of this project, run:

    lein jar

This compiles and builds the JAR file.


RUNNING THE EXAMPLES & TESTS

After building, copy the file from

    clojure-hadoop-${VERSION}.jar

to something short, like "examples.jar".  Each of the *.clj files in
the test/clojure_hadoop/examples directory contains instructions for
running that example.

The wordcount examples can also be run via the "lein test" command.


USING THE LIBRARY IN HADOOP

After building, include the "clojure-hadoop-${VERSION}.jar" file
in the lib/ directory of the JAR you submit as your Hadoop job.


DEPENDING ON THE LIBRARY WITH MAVEN

You can depend on clojure-hadoop in your Maven 2 projects by adding
the following lines to your pom.xml:

    <dependencies>
      ...

      <dependency>
        <groupId>clojure-hadoop</groupId>
        <artifactId>clojure-hadoop</artifactId>
        <version>${VERSION}</version>
      </dependency>

      ...
    </dependencies>
    ...
    <repositories>
      ...

      <repository>
        <id>clojars</id>
        <url> http://clojars.org/repo </url>
      </repository>

      ...
    </repositories>


USING THE LIBRARY

This library provides different layers of abstraction away from the
raw Hadoop API.

Layer 1: clojure-hadoop.imports

    Provides convenience functions for importing the many classes and
    interfaces in the Hadoop API.

Layer 2: clojure-hadoop.gen

    Provides gen-class macros to generate the multiple classes needed
    for a MapReduce job.  See the example file "wordcount1.clj" for a
    demonstration of these macros.

Layer 3: clojure-hadoop.wrap

    clojure-hadoop.wrap: provides wrapper functions that automatically
    convert between Hadoop Text objects and Clojure data structures.
    See the example file "wordcount2.clj" for a demonstration of these
    wrappers.

Layer 4: clojure-hadoop.job

    Provides a complete implementation of a Hadoop MapReduce job that
    can be dynamically configured to use any Clojure functions in the
    map and reduce phases.  See the example file "wordcount3.clj" for
    a demonstration of this usage.

Layer 5: clojure-hadoop.defjob

    A convenient macro to configure MapReduce jobs with Clojure code.
    See the example files "wordcount4.clj" and "wordcount5.clj" for
    demonstrations of this macro.

Layer 6: clojure-hadoop.defjob - Specifying JobConf parameters 

    Often its necessary to specify parameters in the job's 
    configuration to in order to enable dynamic map/reduce jobs.
    Hadoop natively enables this through the -D<key>=<value>
    commandline specification.
   
    Using the convenient defjob macro, "wordcount6.clj" demonstrates
    how to set job configuration (JobConf) parameters either via
    the commandline, or as part of the defjob defintion within the file.

 Layer 7: clojure-hadoop.config - Adding files and archives 
to the DistributedCache.
    
    Example file "wordcount7.clj" demonstrates how to specify files
    and archives for distribution to across nodes via the 
    DistributedCache, as well as how to access the files 
    during the mapper-setup or reducer-setup phases.

NOTES

* README.txt changed to reflect the Leiningen build process (Roman Scherer).

clojure-hadoop's People

Contributors

alexott avatar cdorrat avatar cmiles74 avatar davelambert avatar eslick avatar methylene avatar mtnygard avatar r0man avatar valyagolev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clojure-hadoop's Issues

Nicer way to pass arguments to jobs

It would be nice if there was a nicer functional-feeling clojurey way to parameterise jobs with some config parameters. At the moment if I understand correctly I need to specify a :map-setup which rebinds some global state based on config arguments specified on the command line.

Would it be possible to avoid the need for this rebinding, or at least to hide it behind a nicer defjob macro along the lines of:

(defn my-map
  [ngram-order key value]
  ;; ...
  )

(defjob foo
  [config]
  {:map (partial my-map (:ngram-order config))
   ; ... etc ...
  })

Realise there are probably inherent limitations here in the way hadoop instantiates these things, but thought the API feedback might be useful :)

How to do a MultipleTextOutputFormat

Sorry for putting an issue in for this but was not sure where the best place was to ask this question. Is there a way to define a MultipleTextOutputFormat? Or do I need to do

(defjob/defjob job
...
:output-format my-class
...
)

And do some gen-class exercises to define my-class. I tried to just create the function and reference that from output-format but no luck as it wants a class.

I appreciate any help (and the library :))

Set Kerberos credential before submitting HBase job

Submitting a map-reduce job which accesses HBase, we have to check User.isSecurityEnabled(), like the following piece in Java:

if (User.isSecurityEnabled()) {
    try {
        User.getCurrent().obtainAuthTokenForJob(job.getConfiguration(), job);
    } catch (IOException ioe) {
        LOG.error(job.getJobName()+ ": Failed to obtain current user.");
    } catch (InterruptedException ie) {
        LOG.info(job.getJobName()+ ": Interrupted obtaining user authentication token");
        Thread.interrupted();
    }
}

I have written a patch, but there's still some problem I'm debugging with.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.