GithubHelp home page GithubHelp logo

hackathonclt's Introduction

hackathonclt

Welcome to hackathonclt!

User Creation

http://dev.hackathonclt.org:5000/

##Machines Machines avaliable to work on:

> slave01.hackathonclt.org

> slave02.hackathonclt.org

> slave03.hackathonclt.org

> slave04.hackathonclt.org

Please spread yourselvs out across the machines

NOTE: if you have any dns issues, speak to staff

Getting Started

Ssh into the server where you can access the retail data stored on the hackathon HDFS cluster.

and enter the password you specified in user creation.

We made Hive, Spark, and pySpark command-line interfaces available, and included a tool to compile and run simple Scalding scripts on-the-fly.

Hive

Give Hive a whirl and run a sample query:

> hive

Try pasting the following query into the hive command-line interface:

hive> select UPC_NUMBER, ITEM_DESCRIPTION, DEPARTMENT_DESCRIPTION, EXTENDED_PRICE_AMOUNT from hackathon_sample_real limit 10;

This will a launch a (map-only) MapReduce job and return the specified fields for the first ten items in the 'hackathon' table.

Spark

Now give the Spark-shell a test:

> spark-shell

Read in the data and run a simple query that calculates the number of purchases for each upc in the sample data:

val dataRDD = sc.textFile("hdfs://master.hackathonclt.org:8020/sample/data_with_headers/hackathon_data_headers")
val upcs = dataRDD.flatMap(line => line.split("\\|").take(1))
val wordCounts = upcs.map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.take(10)

pySpark

You can also do the same query using a python version of the Spark shell.

> pyspark

dataRDD = sc.textFile("hdfs://master.hackathonclt.org:8020/sample/data_with_headers/hackathon_data_headers")
upcs = dataRDD.map(lambda line: line.split('|')[0])
wordCounts = upcs.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
wordCounts.take(10)

Scalding

In addition to the Hive and Spark shells, we're also packaging Eval-tool, a tool to compile and run Scalding scripts without having to create a project. If you create a file called test.scala with the following contents:

import com.twitter.scalding._
import com.tresata.scalding.Dsl._
import com.tresata.scalding.util.ScaldingUtil

(args: Args) => {
  new Job(args) {
    ScaldingUtil.sourceFromArg(args("input"))
      .groupBy('UPC_NUMBER) { _.size }
      .write(ScaldingUtil.sourceFromArg(args("output")))
  }
}

you can run a query on the data set sample from the command-line:

> eval-tool test.scala --hdfs --input bsv%/sample/data_with_headers/hackathon_data_headers --output bsv%upc_counts

This will generate a bar-separated file called 'upc_counts' in your HDFS home directory, containing the upc numbers along with their total counts.

To access your HDFS location, you need to use hadoop fs commands (some reference: http://www.folkstalk.com/2013/09/hadoop-fs-shell-command-example-tutorial.html). For example, to take a look at your home directory on HDFS, use

> hadoop fs -ls

or

> hadoop fs -ls /user/username

##Job Tracker http://master.hackathonclt.org:50030

##Spark Job Tracker http://master.hackathonclt.org:8080

##Namenode information http://master.hackathonclt.org:50070

##Data Dictionary UPC_NUMBER long unique product code of item MASTER_UPC_NUMBER long master UPC number, UPC numbers go under this
ITEM_DESCRIPTION string describes item DEPARTMENT_NUMBER long department number DEPARTMENT_DESCRIPTION string describes department CATEGORY_NUMBER long category number of item CATEGORY_DESCRIPTION string describes category of item SUBCATEGORY_NUMBER long subcategory of item SUBCATEGORY_DESCRIPTION string describes subcategory of item RECEIPT_NUMBER string recipe number of the purchase ITEM_QUANTITY long how many items was bought EXTENDED_PRICE_AMOUNT float actual sale per swipe DISCOUNT_QUANTITY float number of coupons applied EXTENDED_DISCOUNT_AMOUNT float amount discounted TENDER_AMOUNT float amount tendered by the customer for the transaction TRANSACTION_DATETIME string date of transaction EXPRESS_LANE long flag of whether the purchase was through Express Lane, tagged to recipe number. 1 mean yes, 0 means no HHID string house hold id

hackathonclt's People

Contributors

jackdwyer avatar andy327 avatar adamferguson avatar

Stargazers

Paul Prae avatar Ben Porter avatar

Watchers

Koert Kuipers avatar  avatar James Cloos avatar  avatar  avatar  avatar  avatar Swapnil Shah avatar  avatar Marc Dupuis avatar Sam Echikson avatar Hardik Bhut avatar  avatar Ashish Panchal avatar Amir Babaeian  avatar VINOTH KUMAR SADAGOPAN avatar  avatar Vimal Patel avatar Eileen Klaiklung avatar  avatar  avatar  avatar jaydipsinh avatar  avatar Joshua Clark avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.