GithubHelp home page GithubHelp logo

impatient's Introduction

Cascading for the Impatient

Welcome to Cascading for the Impatient, a tutorial for Cascading 3.1.x to get you started. Quickly. Like, yesterday.

This set of progressive coding examples starts with a simple file copy and builds up to a MapReduce implementation of the TF-IDF algorithm.

You can read the full series here: http://docs.cascading.org/impatient/

If you have a question or run into any problems send an email to the cascading-user-list.

Part 1

  • Implements simplest Cascading app possible
  • Copies each TSV line from source tap to sink tap
  • Roughly, in about a dozen lines of code
  • Physical plan: 1 Mapper

Part 2

  • Implements a simple example of WordCount
  • Uses a regex to split the input text lines into a token stream
  • Generates a DOT file, to show the Cascading flow graphically
  • Physical plan: 1 Mapper, 1 Reducer

Part 3

  • Uses a custom Function to scrub the token stream
  • Discusses when to use standard Operations vs. creating custom ones
  • Physical plan: 1 Mapper, 1 Reducer

Part 4

  • Shows how to use a HashJoin on two pipes
  • Filters a list of stop words out of the token stream
  • Physical plan: 1 Mapper, 1 Reducer

Part 5

  • Calculates TF-IDF using an ExpressionFunction
  • Shows how to use a CountBy, SumBy, and a CoGroup
  • Physical plan: 10 Mappers, 8 Reducers

Part 6

  • Includes unit tests in the build
  • Shows how to use other TDD features: checkpoints, assertions, traps, debug
  • Physical plan: 11 Mappers, 8 Reducers

Part 7

This example is currently not implemented.

Part 8

  • Scalding equivalents of previous examples in Cascading

impatient's People

Contributors

ceteri avatar cwensel avatar dvryaboy avatar fs111 avatar rdesmond avatar stmcpherson avatar supreetoberoi avatar zac-hopkinson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

impatient's Issues

error when run impatient example in IDEA

My Hadoop version is hadoop-2.5.0-cdh5.2.0, It works fine When

  1. start hdfs and yarn
  2. follow the tutorial use cmd: hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain.

then I try run the demo in IDEA.
I use maven project, my pom.xml like :
<hadoop.version>2.5.0-cdh5.2.0</hadoop.version>
<cascading.version>2.6.1</cascading.version>

driven driven-plugin 1.2-eap-5 cascading cascading-core ${cascading.version} cascading cascading-local ${cascading.version} cascading cascading-hadoop ${cascading.version} cascading cascading-hadoop2-mr1 ${cascading.version} cascading cascading-xml ${cascading.version} cascading cascading-platform ${cascading.version} test then I copy Main.java, run it . Since I set inPath and outPath to local fs, I didn't start hadoop ## Error happen:

Exception in thread "main" cascading.flow.FlowException: unhandled exception
at cascading.flow.BaseFlow.complete(BaseFlow.java:918)
at com.zqh.cascading.impatient.Copy.main(Copy.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.mapred.LocalJobRunner.(Lorg/apache/hadoop/conf/Configuration;)V
at org.apache.hadoop.mapred.LocalClientProtocolProvider.create(LocalClientProtocolProvider.java:42)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:472)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:450)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:107)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:207)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:150)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

there are also errors when I start hadoop and run it .
Does this error means I use the wrong hadoop version? since my cascading-hadoop version is cascading-hadoop2-mr1. does cascading not support mr2 or yarn?

Build fails with Java 7, Gradle 2.0, and Hadoop 2.4.1 on Ubuntu 14.04

When I try to build:

me@computer:~/github/Impatient$ ~/dev/gradle-2.0/bin/gradle build.gradle    

FAILURE: Build failed with an exception.

* Where:
Build file '/home/me/github/Impatient/part1/build.gradle' line: 43

* What went wrong:
A problem occurred evaluating project ':part1'.
> You can't change configuration 'providedCompile' because it is already resolved!

I have cleared my .m2 repo, but that didn't help.

My version info

Gradle

me@computer:~/github/Impatient$ ~/dev/gradle-2.0/bin/gradle --version

------------------------------------------------------------
Gradle 2.0
------------------------------------------------------------

Build time:   2014-07-01 07:45:34 UTC
Build number: none
Revision:     b6ead6fa452dfdadec484059191eb641d817226c

Groovy:       2.3.3
Ant:          Apache Ant(TM) version 1.9.3 compiled on December 23 2013
JVM:          1.7.0_55 (Oracle Corporation 24.51-b03)
OS:           Linux 3.13.0-32-generic amd64

Hadoop

me@computer:~/github/Impatient$ ~/dev/hadoop-2.4.1/bin/hadoop version
Hadoop 2.4.1
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1604318
Compiled by jenkins on 2014-06-21T05:43Z
Compiled with protoc 2.5.0
From source with checksum bb7ac0a3c73dc131f4844b873c74b630
This command was run using /home/me/dev/hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jar

Cannot get get Gradle to build

$ gradle clean jar

FAILURE: Build failed with an exception.

  • Where:
    Build file '/home/lina/dev/workspaces/Impatient/part1/build.gradle' line: 31
  • What went wrong:
    A problem occurred evaluating root project 'part1'.
    Cause: You must specify a urls for a Maven repo.
  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1.785 secs

Some details:

$ gradle --version

Gradle 1.0-milestone-3

Gradle build time: Thursday, September 8, 2011 4:06:52 PM UTC
Groovy: 1.8.6
Ant: Apache Ant(TM) version 1.8.2 compiled on December 3 2011
Ivy: non official version
JVM: 1.6.0_24 (Sun Microsystems Inc. 20.0-b12)
OS: Linux 3.2.0-31-generic amd64

I tried this solution, but it didn't work:
http://www.baselogic.com/blog/development/gradle-repositories-syntax-changed-cause-url-maven-repository/

part1 doesn't compile

On d9d5eec

fish:part1 dirkraft$ gradle jar
:compileJava

FAILURE: Build failed with an exception.

* What went wrong:
Could not resolve all dependencies for configuration ':providedCompile'.
> Artifact 'org.codehaus.jackson:jackson-jaxrs:1.7.1@jar' not found.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 4.209 secs
fish:part1 dirkraft$

issue in "gradle clean jar"

while typing command gradle clean jar i got this btw i have hadoop 1.2.1 and cascading 2.6 i put cascading jars in hadoop/lib , and in cascading impatient documentation work with hadoop1 et hadoop 2
co

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.