GithubHelp home page GithubHelp logo

herringbone's Introduction

Herringbone

Herringbone is deprecated and is no longer being actively maintained.

Herringbone is a suite of tools for working with parquet files on hdfs, and with impala and hive.

The available commands are:

flatten: transform a directory of parquet files with a nested structure into a directory of parquet files with a flat schema that can be loaded into impala or hive (neither of which support nested schemas). Default output directory is /path/to/input/directory-flat.

$ herringbone flatten -i /path/to/input/directory [-o /path/to/non/default/output/directory]

load: load a directory of parquet files (which must have a flat schema) into impala or hive (defaulting to impala). Use the --nocompute-stats option for faster loading into impala (but probably slower querying later on!)

$ herringbone load [--hive] [-u] [--nocompute-stats] -d db_name -t table -p /path/to/parquet/directory

tsv: transform a directory of parquet files into a directory of tsv files (which you can concat properly later with hadoop fs -getmerge /path/to/tsvs). Default output directory is /path/to/input/directory-tsv.

$ herringbone tsv -i /path/to/input/directory [-o /path/to/non/default/output/directory]

compact: transform a directory of parquet files into a directory of fewer larger parquet files. Default output directory is /path/to/input/directory-compact.

$ herringbone compact -i /path/to/input/directory [-o /path/to/non/default/output/directory]

See herringbone COMMAND --help for more information on a specific command.

Building

You'll need thrift 0.9.1 on your path.

$ git clone github.com/stripe/herringbone
$ cd herringbone
$ mvn package

Authors

herringbone's People

Contributors

alyssa-stripe avatar cji-stripe avatar colinmarc avatar daniellesucher avatar jbalogh avatar jbalogh-stripe avatar jocelyn-stripe avatar nati-stripe avatar rob-stripe avatar thairu avatar thairu-stripe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

herringbone's Issues

Facebook dependence not found error

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/darouwan/herringbone/herringbone-impala/target/generated-sources/thrift/com/facebook/fb303/FacebookService.java:[425,7] cannot find symbol
symbol: method sendBaseOneway(java.lang.String,com.facebook.fb303.FacebookService.reinitialize_args)
location: class com.facebook.fb303.FacebookService.Client
[ERROR] /home/darouwan/herringbone/herringbone-impala/target/generated-sources/thrift/com/facebook/fb303/FacebookService.java:[436,7] cannot find symbol
symbol: method sendBaseOneway(java.lang.String,com.facebook.fb303.FacebookService.shutdown_args)
location: class com.facebook.fb303.FacebookService.Client
[INFO] 2 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Herringbone Impala ................................. FAILURE [ 29.045 s]
[INFO] Herringbone Main ................................... SKIPPED
[INFO] Herringbone ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29.218 s
[INFO] Finished at: 2018-12-24T09:48:18+00:00
[INFO] Final Memory: 20M/410M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project herringbone-impala: Compilation failure: Compilation failure:
[ERROR] /home/darouwan/herringbone/herringbone-impala/target/generated-sources/thrift/com/facebook/fb303/FacebookService.java:[425,7] cannot find symbol
[ERROR] symbol: method sendBaseOneway(java.lang.String,com.facebook.fb303.FacebookService.reinitialize_args)
[ERROR] location: class com.facebook.fb303.FacebookService.Client
[ERROR] /home/darouwan/herringbone/herringbone-impala/target/generated-sources/thrift/com/facebook/fb303/FacebookService.java:[436,7] cannot find symbol
[ERROR] symbol: method sendBaseOneway(java.lang.String,com.facebook.fb303.FacebookService.shutdown_args)
[ERROR] location: class com.facebook.fb303.FacebookService.Client
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Exit w/an understandable error message if you try to load only one partition

Partitioned tables can't be loaded one partition at a time - you have to load the parent directory containing all the partitions instead. This is an easy mistake to make, though! So if someone tries it, give them an error message that actually makes sense and tells them what they should do instead.

(It might be worth adding a way to load only one partition also/instead, but that seems a bit more confusing to me at the moment.)

maven compile failed with errors herringbone-impala: thrift did not exit cleanly

I was trying to compile the project on Mac OSX with java 1.7 and get errors.
Please see complete error trace here:
https://gist.github.com/edvorkin/a3be9a914980de74a45b

[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Herringbone Impala
[INFO] Herringbone Main
[INFO] Herringbone
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Herringbone Impala 0.0.2
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ herringbone-impala ---
[INFO] Deleting /Users/eugene.dvorkin/IdeaProjects/herringbone/herringbone-impala/target
[INFO]
[INFO] --- maven-thrift-plugin:0.1.11:compile (thrift-sources) @ herringbone-impala ---
[ERROR] thrift failed output:
[ERROR] thrift failed error: /bin/sh: thrift: command not found

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Herringbone Impala ................................. FAILURE [ 0.541 s]
[INFO] Herringbone Main ................................... SKIPPED
[INFO] Herringbone ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.641 s
[INFO] Finished at: 2014-12-07T10:15:47-05:00
[INFO] Final Memory: 8M/245M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.thrift.tools:maven-thrift-plugin:0.1.11:compile (thrift-sources) on project herringbone-impala: thrift did not exit cleanly. Review output for more information. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.thrift.tools:maven-thrift-plugin:0.1.11:compile (thrift-sources) on project herringbone-impala: thrift did not exit cleanly. Review output for more information.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:347)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:154)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:582)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:214)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoFailureException: thrift did not exit cleanly. Review output for more information.
at org.apache.thrift.maven.AbstractThriftMojo.execute(AbstractThriftMojo.java:177)
at org.apache.thrift.maven.ThriftCompileMojo.execute(ThriftCompileMojo.java:21)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 19 more
[ERROR]
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Readme lacks important install tips

how to install thrift, needed to build

  • On ubuntu 18.04: sudo apt install thrift-compiler should still get you 0.9.1
  • Otherwise: conda install thrift gets you 0.11 and you need the patches below

optional patches

  • prevent Error: unknown option java:hashcode with recent thrift versions by adding line as follows (source)
  • prevent the following error by using the libthrift matching the system thrift binary as follows:
    [ERROR] .../herringbone/herringbone-impala/target/generated-sources/thrift/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore.java:[6655,7] method does not override or implement a method from a supertype
diff --git i/herringbone-impala/pom.xml w/herringbone-impala/pom.xml
--- i/herringbone-impala/pom.xml
+++ w/herringbone-impala/pom.xml
@@ -59,6 +59,7 @@
         <configuration>
           <checkStaleness>true</checkStaleness>
           <thriftExecutable>thrift</thriftExecutable>
+          <generator>java</generator>
         </configuration>
         <executions>
           <execution>
@@ -105,7 +106,7 @@
     <dependency>
       <groupId>org.apache.thrift</groupId>
       <artifactId>libthrift</artifactId>
-      <version>0.12.0</version>
+      <version>0.11.0</version>
     </dependency>
     <dependency>
       <groupId>org.slf4j</groupId>

hadoop 2 is needed to run

Hadoop 2 is needed otherwise there are weird errors, presumably due to guava version change or mismatch

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
tar -xvf hadoop-2.9.2.tar.gz
echo '
export HADOOP_HOME=$HOME/herringbone/hadoop-2.9.2   # <-- adapt the following
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
' > hadoop_env.sh

then you'll need to source ./hadoop_env.sh before runningbin/herringbone --help

`flatten` silently loses data

My parquet has a list of struct as follows (extract of fastparquet schema output):

| - info: OPTIONAL
| | - cause: BYTE_ARRAY, UTF8, OPTIONAL
|   - classes: LIST, OPTIONAL
|     - list: REPEATED
|       - element: OPTIONAL
|       | - class_id: BYTE_ARRAY, UTF8, OPTIONAL
|         - posting: INT32, OPTIONAL

While some bits are correctly flattened to e.g. info__cause, the classes bit is silently lost: there is no info__classes, (I presume, because it's a list of struct).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.