GithubHelp home page GithubHelp logo

hschmidt / conductor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bd2kgenomics/conductor

0.0 2.0 0.0 84 KB

Efficient, distributed downloads of large files from S3 to HDFS using Spark.

License: Apache License 2.0

Scala 100.00%

conductor's Introduction

Conductor for Apache Spark provides efficient, distributed transfers of large files from S3 to HDFS and back.

Hadoop's distcp utility supports transfers to/from S3 but does not distribute the download of a single large file over multiple nodes. Amazon's s3distcp is intended to fill that gap but, to our best knowledge, hasn't not been released as open source.

A cluster of ten r3.xlarge nodes downloaded a 288GiB file in 377 seconds to an HDFS installation with replication factor 1, yielding an aggregate transfer rate of 782 MiB/s. For comparison, distcp typically gives you 50-80MB/s on that instance type. A cluster of one hundred r3.xlarge nodes downloaded that same file in 80 seconds, yielding an aggregate transfer rate of 3.683 GiB/s.

Prerequisites

Run time:

  • JRE 1.7+
  • Spark cluster backed by HDFS

Build time:

  • JDK 1.7+
  • Scala SDK 2.10
  • Maven

Scala 2.11 and Java 1.8 may work, too. We simply haven't tested those, yet.

Usage

Downloads:

export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit conductor-VERSION-distribution.jar \
             s3://BUCKET/KEY \
             hdfs://HOST[:PORT]/PATH \
             [--s3-part-size <value>] \
             [--hdfs-block-size <value>] \
             [--concat]

Uploads:

export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit conductor-VERSION-distribution.jar \
             hdfs://HOST[:PORT]/PATH \
             s3://BUCKET/KEY \
             [--concat]

Using the --concat flag concatenates all the parts of the files following the upload or download. The source path can be to either a file or directory. If the path points to a file, the parts will be created in the specified part sizes; if it points to a directory, each part will correspond to a file in the directory. Concatenation only works in downloader if all of the parts except for the last one are equal-sized and multiples of the specified block size.

If running Spark-on-YARN, you can pass the AWS access/secret keys by passing the following config flags to spark-submit:

` --conf spark.yarn.appMasterEnv.AWS_ACCESS_KEY=... --conf spark.yarn.appMasterEnv.AWS_SECRET_KEY=... `

Tests

export AWS_ACCESS_KEY=...
export AWS_SECRET_KEY=...
spark-submit --conf spark.driver.memory=1G \
             --executor-memory 1G \
             conductor-integration-tests-0.4-SNAPSHOT-distribution.jar \
             -e -s edu.ucsc.cgl.conductor.ConductorIntegrationTests

Build

mvn package

You can customize the Spark and Hadoop versions to build against, by setting the spark.version and hadoop.version properties, for example:

mvn package -Dspark.version=1.5.2 -Dhadoop.version=2.6.2

Caveats

  • Beta-quality
  • Uses Spark, not Yarn/MapReduce
  • Destination must be a full hdfs:// URL, the fs.default.name property is ignored
  • On failure, temporary files may be left around
  • S3 credentials may be set via Java properties or environment variables as described in the AWS API documentation but are not read from core-site.xml

Contributors

Hannes Schmidt created the first bare-bones implementation of distributed downloads from S3 to HDFS, originally called spark-s3-downloader.

Clayton Sanford made the HDFS block size and S3 part size configurable, added upload support, optional concatenation and wrote integration tests. During his efforts the project was renamed Conductor.

conductor's People

Contributors

fnothaft avatar hannes-ucsc avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.