GithubHelp home page GithubHelp logo

nativetask's Introduction

Introduction

NativeTask is a high performance C++ API & runtime for Hadoop MapReduce. Why it is called NativeTask is that it is a native computing unit only focus on data processing, which is exactly what Task do in the Hadoop MapReduce context. In other word, NativeTask is not responsible for resource management, job Scheduling and fault-tolerance. Those are all managed by original Hadoop components as before, unchanged. But the actual data processing and computation, which consumes most of cluster resources, are delegated to this highly efficient data processing unit.

NativeTask is designed to be very fast, with native C++ API. So more efficient data analysis applications can build upon it, like LLVM based query execution engine mentioned in Google's Tenzing. Actually this is the main objective of NativeTask, to provide a efficient native Hadoop framework, so much more efficient data analyze tools can be built upon it:

  • Data warehousing tool using state of the art query execution techniques existing in parallel DBMSs, such as compression, vectorization, dynamic compilation, etc. These techniques are more easy to implement in native code, as we can see that most of these techniques are implemented using C/C++: Vectorwise, Vertica.

  • High performance data mining/machine learning libraries, most of these algorithms are CPU intensive, involving lot of numerical computation, or have been implemented using native languages already, a native runtime permits better performance, or easy porting these algorithms to Hadoop.

From user's perspective, NativeTask is a lot like Hadoop Pipes: using header files and dynamic libraries provided in NativeTask library, you compile your application or class library to a dynamic library rather than executable program(because we use JNI), then using a Submitter tool to submit you job to Hadoop cluster like streaming or pipes do. For more information, please read the design document and examples in src/main/native/examples.

Features

  1. High performance, more cost effective for your Hadoop cluster;
  2. C++ API, so user can develop native applications or apply more aggressive optimizations not available or convenient for java, like SSE/AVX instruction, LLVM, GPU computing, coprocessor etc.
  3. Support no sort, by removing sort, the shuffle stage barrier can be eliminated, yielding better data processing throughput;
  4. Support foldl style API, much faster for aggregation queries;
  5. Binary based MapReduce API, no serialization/deserialization overhead;
  6. Compatible with Hadoop 0.20-0.23(need task-delegation patch)

Notice

This project is in very early stages currently, and is not well documented. If you are familiar with Hadoop MapReduce, you can hack into the source code. For more informantion, please read the design document

Also you can find some discussion in Hadoop JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-2841

nativetask's People

Contributors

decster avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.