GithubHelp home page GithubHelp logo

isterin / cascalog Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nathanmarz/cascalog

1.0 1.0 0.0 623 KB

Data processing on Hadoop without the hassle.

License: GNU General Public License v3.0

cascalog's Introduction

About

Cascalog is a tool for processing data on Hadoop with Clojure in a concise, expressive, and highly readable manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

Most query languages, like SQL, Pig, and Hive, are custom languages -- and this leads to huge amounts of accidental complexity. Constructing queries dynamically by doing string manipulation is haphazard and leads to further complexity such as SQL injection attacks. The nature of Cascalog being a domain specific language in Clojure avoids these accidental complexities and allows a programmer to manipulate queries as first-class entities within the language. The Datalog syntax of Cascalog is simpler and more expressive than SQL-based languages.

Follow the getting started steps, check out the tutorial, and you'll be running Cascalog queries on your local computer within 5 minutes.

Getting started

  1. Make sure you have java 1.6
  2. export JAVA_OPTS=-Xmx768m
  3. install leiningen
  4. git clone git://github.com/nathanmarz/cascalog.git
  5. cd cascalog && lein deps && lein compile
  6. optionally run "lein test" to make sure tests pass

The entire Cascalog API is defined within src/clj/cascalog/api.clj . Helpers for testing queries can be found in src/clj/cascalog/testing.clj .

Tutorials

  1. Introducing Cascalog
  2. New Cascalog features: outer joins, combiners, sorting, and more
  3. News Feed in 38 lines of code using Cascalog
  4. Cascalog features for consuming wide taps
  5. Predicate macros

Running Cascalog queries on a Hadoop cluster

  1. Cascalog includes hadoop as a dependency so that you can experiment with it easily. Don't include Hadoop jars within your jar that has Cascalog.
  2. Cascalog requires Cascading 1.1
  3. Any custom operations must be compiled into the jar you give to Hadoop for running jobs

Questions?

Google group: cascalog-user

IM: Come chat in the #cascading room on freenode

Priorities for Cascalog development

  1. Replicated and bloom joins
  2. Cross query optimization: push constants and filters down into subqueries when possible
  3. Negations, i.e. "people who like dogs and don't like cats" (<- [?p] (likes ?p "dogs") (likes ?p "cats" :> false)) [implement with multigroupby of some sort]
  4. Disjunction, i.e. "all people over 30 years old and all males" (<- [?p] [(age ?p ?a) (> ?a 30)] [(gender ?p "m")])])
  5. Recursion, i.e. "all ancestry relations" (<- [?a ?p] [(parent ?a ?p)] [(parent ?a ?p2) (recur ?p2 ?p))])

Acknowledgements

YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.

Cascalog is based off of a very early branch of cascading-clojure project (http://github.com/clj-sys/cascading-clojure). Special thanks to Bradford Cross and Mark McGranaghan for their work on that project. Much of that code appears within Cascalog in either its original form or a modified form.

cascalog's People

Contributors

nathanmarz avatar gavinheavyside avatar ztellman avatar

Stargazers

Ilya Sterin avatar

Watchers

Ilya Sterin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.