GithubHelp home page GithubHelp logo

azure-documentdb-hadoop's Introduction

Microsoft Azure DocumentDB Hadoop Connector

This project provides a client library in Java that allows Microsoft Azure DocumentDB to act as an input source or output sink for MapReduce, Hive and Pig jobs.

Download

Option 1: Via Github

To get the binaries of this library as distributed by Microsoft, ready for use within your project, you can use GitHub releases.

Option 2: Source Via Git

To get the source code of the connector via git just type:

git clone git://github.com/Azure/azure-documentdb-hadoop.git

Option 3: Source Zip

To download a copy of the source code, click "Download ZIP" on the right side of the page or click here.

Option 4: Via Maven

To get the binaries of this library as distributed by Microsoft, ready for use within your project, you can use Maven.

<dependency>
	<groupId>com.microsoft.azure</groupId>
	<artifactId>azure-documentdb-hadoop</artifactId>
	<version>1.2.0</version>
</dependency>

Option 5: HDInsight

Install the DocumentDB Hadoop Connector onto HDInsight clusters through custom action scripts. Full instructions can be found here.

Requirements

  • Java Development Kit 7

Supported Versions

  • Apache Hadoop & YARN 2.4.0
    • Apache Pig 0.12.1
    • Apache Hive & HCatalog 0.13.1
  • Apache Hadoop & YARN 2.6.0
    • Apache Pig 0.14.0
    • Apache Hive $ HCatalog 0.14.0
  • HDI 3.1 (Getting started with HDInsight)
  • HDI 3.2

Dependencies

  • Microsoft Azure DocumentDB Java SDK 1.6.0 (com.microsoft.azure / azure-documentdb / 1.6.0)

When using Hive:

  • OpenX Technologies JsonSerde 1.3.1-SNAPSHOT (org.openx.data / json-serde-parent / 1.3.1-SNAPSHOT) GitHub repo can be found here

Please download the jars and add them to your build path.

Usage

To use this client library with Azure DocumentDB, you need to first create an account.

MapReduce

Configuring input and output from DocumentDB Example
    // Import Hadoop Connector Classes
    import com.microsoft.azure.documentdb.Document;
    import com.microsoft.azure.documentdb.hadoop.ConfigurationUtil;
    import com.microsoft.azure.documentdb.hadoop.DocumentDBInputFormat;
    import com.microsoft.azure.documentdb.hadoop.DocumentDBOutputFormat;
    import com.microsoft.azure.documentdb.hadoop.DocumentDBWritable;

    // Set Configurations
    Configuration conf = new Configuration();
    final String host = "Your DocumentDB Endpoint";
    final String key = "Your DocumentDB Primary Key";
    final String dbName = "Your DocumentDB Database Name";
    final String inputCollNames = "Your DocumentDB Input Collection Name[s]";
    final String outputCollNames = "Your DocumentDB Output Collection Name[s]";
    final String query = "[Optional] Your DocumentDB Query";
    final String outputStringPrecision = "[Optional] Number of bytes to use for String indexes"
    final String offerType = "[Optional] Your performance level for Output Collection Creations";
    final String upsert = "[Optional] Bool to disable or enable document upsert";

    conf.set(ConfigurationUtil.DB_HOST, host);
    conf.set(ConfigurationUtil.DB_KEY, key);
    conf.set(ConfigurationUtil.DB_NAME, dbName);
    conf.set(ConfigurationUtil.INPUT_COLLECTION_NAMES, inputCollNames);
    conf.set(ConfigurationUtil.OUTPUT_COLLECTION_NAMES, outputCollNames);
    conf.set(ConfigurationUtil.QUERY, query);
    conf.set(ConfigurationUtil.OUTPUT_STRING_PRECISION, outputStringPrecision);
    conf.set(ConfigurationUtil.OUTPUT_COLLECTIONS_OFFER, offerType);
    conf.set(ConfigurationUtil.UPSERT, upsert);

Full MapReduce sample can be found here.

Hive

Loading data from DocumentDB Example
    CREATE EXTERNAL TABLE DocumentDB_Hive_Table( COLUMNS )
    STORED BY 'com.microsoft.azure.documentdb.hive.DocumentDBStorageHandler'
    tblproperties (
        'DocumentDB.endpoint' = 'Your DocumentDB Endpoint',
        'DocumentDB.key' = 'Your DocumentDB Primary Key',
        'DocumentDB.db' = 'Your DocumentDB Database Name',
        'DocumentDB.inputCollections' = 'Your DocumentDB Input Collection Name[s]',
        'DocumentDB.query' = '[Optional] Your DocumentDB Query' );
Storing data to DocumentDB Example
    CREATE EXTERNAL TABLE Hive_DocumentDB_Table( COLUMNS )
    STORED BY 'com.microsoft.azure.documentdb.hive.DocumentDBStorageHandler' 
    tblproperties ( 
        'DocumentDB.endpoint' = 'Your DocumentDB Endpoint', 
        'DocumentDB.key' = 'Your DocumentDB Primary Key', 
        'DocumentDB.db' = 'Your DocumentDB Database Name', 
        'DocumentDB.outputCollections' = 'Your DocumentDB Output Collection Name[s]',
        '[Optional] DocumentDB.outputStringPrecision' = '[Optional] Number of bytes to use for String indexes',
        '[Optional] DocumentDB.outputCollectionsOffer' = '[Optional] Your performance level for Output Collection Creations',
        '[Optional] DocumentDB.upsert' = '[Optional] Bool to disable or enable document upsert');
    INSERT INTO TABLE Hive_DocumentDB_Table

Full Hive sample can be found here.

Pig

Loading data from DocumentDB Example
    LOAD 'Your DocumentDB Endpoint' 
    USING com.microsoft.azure.documentdb.hadoop.pig.DocumentDBLoader( 
        'Your DocumentDB Primary Key', 
        'Your DocumentDB Database Name',
        'Your DocumentDB Input Collection Name[s]',
        '[Optional] Your DocumentDB SQL Query' );
Storing data to DocumentDB Example
    STORE data  INTO 'DocumentDB Endpoint' 
    USING com.microsoft.azure.documentdb.hadoop.pig.DocumentDBStorage( 
        'DocumentDB Primary Key',
        'DocumentDB Database Name',
        'DocumentDB Output Collection Name[s]',
        '[Optional] Your performance level for Output Collection Creations',
        '[Optional] Number of bytes to use for String indexes',
        '[Optional] Bool to disable or enable document upsert');

Full Pig sample can be found here.

Remarks

  • When outputting to DocumentDB, your output collection will require capacity for an additional stored procedure. The stored procedure will remain in your collection for reuse.
  • The Hadoop Connector automatically sets your indexes to range indexes with max precision on strings and numbers. More information can be found here.
  • Connector supports configurable upsert option. Upsert configuration is automatically set to true and will overwrite documents within the same collection with the same id.
  • Reads and writes to DocumentDB will be counted against your provisioned throughput for each collection.
  • Output to DocumentDB collections is done in batch round robin.
  • Connector supports configurable offer option. Offer configuration allows users to set the performance tier of their newly creation collections (this does not apply when outputting to an already existing collection).
  • Connector supports output to partitioned collections. Hadoop Connector will not automatically create partitioned collections for Hadoop job outputs.

Need Help?

Be sure to check out the Microsoft Azure Developer Forums on MSDN or the Developer Forums on Stack Overflow if you have trouble with the provided code. Also, check out our tutorial for more information.

Contribute Code or Provide Feedback

If you would like to become an active contributor to this project please follow the instructions provided in Azure Projects Contribution Guidelines.

If you encounter any bugs with the library please file an issue in the Issues section of the project.

Learn More

azure-documentdb-hadoop's People

Contributors

aliuy avatar khdang avatar shipunyc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.