GithubHelp home page GithubHelp logo

isabella232 / ratatool Goto Github PK

View Code? Open in Web Editor NEW

This project forked from spotify/ratatool

0.0 0.0 0.0 1.08 MB

A tool for data sampling, data generation, and data diffing

License: Apache License 2.0

Scala 80.99% Java 19.01%

ratatool's Introduction

Ratatool

CircleCI codecov.io GitHub license Maven Central Scala Steward badge

A tool for random data sampling and generation

Features

  • ScalaCheck Generators - ScalaCheck generators (Gen[T]) for property-based testing for scala case classes, Avro, Protocol Buffers, BigQuery TableRow
  • IO - utilities for reading and writing records in Avro, Parquet (via Avro GenericRecord), BigQuery and TableRow JSON files. Local file system, HDFS and Google Cloud Storage are supported.
  • Samplers - random data samplers for Avro, BigQuery and Parquet. True random sampling is supported for Avro only while head mode (sampling from the start) is supported for all sources.
  • Diffy - field-level record diff tool for Avro, Protobuf and BigQuery TableRow.
  • BigDiffy - Scio library for pairwise field-level statistical diff of data sets. See slides for more.
  • Command line tool - command line tool for local sampler, or executing BigDiffy and BigSampler.
  • Shapeless - An extension for Case Class Diffing via Shapeless.

For more information or documentation, project level READMEs are provided.

Usage

If you use sbt add the following dependency to your build file:

libraryDependencies += "com.spotify" %% "ratatool-scalacheck" % "0.3.10" % "test"

If needed, the following other libraries are published:

  • ratatool-diffy
  • ratatool-sampling

Or install via our Homebrew tap if you're on a Mac:

brew tap spotify/public
brew install ratatool
ratatool

Or download the release jar and run it.

wget https://github.com/spotify/ratatool/releases/download/v0.3.10/ratatool-cli-0.3.10.tar.gz
bin/ratatool directSampler

The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.

bin/ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro
bin/ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet

# write output to both JSON file and BigQuery table
bin/ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \
    --out out.json--tableOut project_id:dataset_id.table_id

It can also be used to sample from HDFS with if core-site.xml and hdfs-site.xml are available.

bin/ratatool bigSampler avro \
    --head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro

Or execute BigDiffy directly

bin/ratatool bigDiffy \
    --input-mode=avro \
    --key=record.key \
    --lhs=gs://path/to/left \
    --rhs=gs://path/to/right \
    --output=gs://path/to/output \
    --runner=DataflowRunner ....

Development

Testing local changes to the CLI before releasing

To test local changes before release:

$ sbt
> project ratatoolCli
> packArchive

and then find the built CLI at ratatool-cli/target/ratatool-cli-{version}.tar.gz

License

Copyright 2016-2018 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

ratatool's People

Contributors

nevillelyh avatar idreeskhan avatar scala-steward avatar anne-decusatis avatar catherinejelder avatar kanterov avatar jackdingilian avatar nicochane avatar mfranberg avatar andrisnoko avatar andrewsmartin avatar danielblazevski avatar regadas avatar leahxu avatar jbigred1 avatar nhanloukiala avatar clairemcginty avatar guidj avatar honnix avatar janlugt avatar martinbomio avatar nikhilraju avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.