spotify / ratatool Goto Github PK

View Code? Open in Web Editor NEW

337.0 29.0 55.0 1.28 MB

A tool for data sampling, data generation, and data diffing

License: Apache License 2.0

Scala 84.49% Java 15.51%

scala scalacheck avro parquet bigquery protobuf

ratatool's Introduction

Ratatool

A tool for random data sampling and generation

Features

ScalaCheck Generators - ScalaCheck generators (Gen[T]) for property-based testing for scala case classes, Avro, Protocol Buffers, BigQuery TableRow
IO - utilities for reading and writing records in Avro, Parquet (via Avro GenericRecord), BigQuery and TableRow JSON files. Local file system, HDFS and Google Cloud Storage are supported.
Samplers - random data samplers for Avro, BigQuery and Parquet. True random sampling is supported for Avro only while head mode (sampling from the start) is supported for all sources.
Diffy - field-level record diff tool for Avro, Protobuf and BigQuery TableRow.
BigDiffy - Scio library for pairwise field-level statistical diff of data sets. See slides for more.
Command line tool - command line tool for local sampler, or executing BigDiffy and BigSampler.
Shapeless - An extension for Case Class Diffing via Shapeless.

For more information or documentation, project level READMEs are provided.

Usage

If you use sbt add the following dependency to your build file:

libraryDependencies += "com.spotify" %% "ratatool-scalacheck" % "0.3.10" % "test"

If needed, the following other libraries are published:

ratatool-diffy
ratatool-sampling

Or install via our Homebrew tap if you're on a Mac:

brew tap spotify/public
brew install ratatool
ratatool

Or download the release jar and run it.

wget https://github.com/spotify/ratatool/releases/download/v0.3.10/ratatool-cli-0.3.10.tar.gz
bin/ratatool directSampler

The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.

bin/ratatool bigSampler avro --head -n 1000 --in gs://path/to/dataset --out out.avro
bin/ratatool bigSampler parquet --head -n 1000 --in gs://path/to/dataset --out out.parquet

# write output to both JSON file and BigQuery table
bin/ratatool bigSampler bigquery --head -n 1000 --in project_id:dataset_id.table_id \
    --out out.json--tableOut project_id:dataset_id.table_id

It can also be used to sample from HDFS with if core-site.xml and hdfs-site.xml are available.

bin/ratatool bigSampler avro \
    --head -n 10 --in hdfs://namenode/path/to/dataset --out file:///path/to/out.avro

Or execute BigDiffy directly

bin/ratatool bigDiffy \
    --input-mode=avro \
    --key=record.key \
    --lhs=gs://path/to/left \
    --rhs=gs://path/to/right \
    --output=gs://path/to/output \
    --runner=DataflowRunner ....

Development

Testing local changes to the CLI before releasing

To test local changes before release:

$ sbt
> project ratatoolCli
> packArchive

and then find the built CLI at ratatool-cli/target/ratatool-cli-{version}.tar.gz

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

ratatool's People

Contributors

Stargazers

Watchers

ratatool's Issues

BigSampler should allow sampling for a distribution (e.g. stratification)

Also potentially Uniform distribution

AvroGenerator intermittently returns duplicates

Calling AvroGenerator.avroOf[T] repeatedly will sometimes generate two identical results. Not sure if this qualifies as a pure bug, but there is a great risk of it leading to flaky test cases for those who are unaware of it.

The cause is likely to be that the underlying RandomData by default is seeded with System.currentTimeMillis, which may sometimes not change between fast subsequent calls.

Example:

  val t1 = System.currentTimeMillis
  val element1 = AvroGenerator.avroOf[StartContext]
  val t2 = System.currentTimeMillis
  val element2 = AvroGenerator.avroOf[StartContext]
  val t3 = System.currentTimeMillis
  val element3 = AvroGenerator.avroOf[StartContext]
  println(s"**** t1: $t1, t2: $t2, t3: $t3")
  println(s"**** e1 == e2: ${element1 == element2}, e2 == e3: ${element2 == element3}")
  println(s"**** e1 eq e2: ${element1 eq element2}, e2 eq e3: ${element2 eq element3}")

Produces the following output:

**** t1: 1503492962231, t2: 1503492962500, t3: 1503492962500
**** e1 == e2: false, e2 == e3: true
**** e1 eq e2: false, e2 eq e3: false

The issue can be worked around by inserting a Thread.sleep(1) between the calls.

Suggested solutions could be to either allow the caller to pass in a seed, or to add a generalized version AvroGenerator.avroOf[T] that returns a sequence of items, using the underlying RandomData.iterator().

Sampler and IO classes should support Iterator

For performance reasons and when sampling large number of records to file, etc.

Improve output readability of BigDiffy

Currently the output is hard to understand without knowing the code

Add ratatool to homebrew

BigSampler fails on finding files in GCS

INFO: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File `gs://endsong.ap.raw-events.spotify.net/com.spotify.LogRecord.ap.EndSong/2018-03-23/10/part-54337a5a9386bf99cf9a0ee5213a6996-000050.avro` does not exist!
INFO:     at scala.Predef$.require(Predef.scala:277)
INFO:     at com.spotify.ratatool.io.AvroIO$.getAvroSchemaFromFile(AvroIO.scala:101)

Deprecate existing generators in favour of ratatool-scalacheck

Upgrade protobuf compiler (protoc) to version 3 and support V3 protobuf messages

Currently compiling protobuf protoc with -v261 despite depending on the latest protoc-jar (3.3.0.1).

Across compilers v2 and v3, the internal representation of a generic message was changed from GeneratedMessage to GeneratedMessageV3 which means functions that expect evidence of that type will fail when compiled with v3. For example, protobufOf expects GeneratedMessage so users of ratatool compiling with v3 will not be able to use this function.

Both GeneratedMessage and GeneratedMessageV3 extend AbstractMessage. Suggestion is to support both compiler versions by having evidence of AbstractMessage.

Keep guava in step with Scio

Currently we are on 0.21 while Scio is on 0.20. Should investigate downgrade and potential issues. Concerned this may cause unexpected errors when running.

Gen with dependency

Case:
Gen[T] is a function on values from Gen[U] and user needs to be able to specify/guarantee this behaviour

Remove assembly steps from build.sbt

Also involves updating Travis. Removing in favour of sbt pack as a way to simplify dep issues and more cleanly address #72

ParquetSampler slow for large files

Potentially better method for setting related/dependent fields

As an example if we have schema where field country and city have an implicit relationship (city should be within country), currently we have to pass the Gen[Country] around and map to a City during the amend, or we have to create a Gen[(Country, City)] and set the relevant fields using that.

This isn't terrible but maybe can be improved

Support case classes in Diffy

We can derive Diffy[T] for case classes using shapeless or magnolia similar to this: https://github.com/nevillelyh/shapeless-datatype/tree/neville/diff/core/src/main/scala/shapeless/datatype/diff.

Also refactor Diffy[T] for Avro/BigQuery/Protobuf, etc. to share the same interface.

Split repository into separate packages

Generators, Diffy, Sampling should be split to keep dependencies separate

`.amend` for Protobuf Generators

Similar to existing implicits for TableRow, SpecificRecord. Potentially expensive because of how writing to Proto output stream works.

Support Avro schema compiled with 1.7.4

Those classes extend IndexedRecord but not GenericRecord.

Ignore fields in BigDiffy

Tool/CLI should provide a cleaner wrapper around other parts of the project

Currently it is mainly a wrapper for sampler, but the jar can be directly invoked to run BigDiffy or BigSampler. It would be better to have the CLI wrap all of this functionality to be easier to use.

Take in multiple functions in amend

An example of this could be

.amend(intGen)(m => i => {
        m.setField1(i)
        m.setField2(dependentIntFunc(i))
      })

would become

.amend(intGen)(,_.setField1, _.setField2)

FileSystems returns file not found when file exists

exception when running BigDiffy

command used:

java -cp ratatool-0.2.5.jar com.spotify.ratatool.diffy.BigDiffy --mode=avro --key=MyKey --lhs=gs://mybucket/*.avro --rhs=gs://myotherbucket/*.avro --output=gs://myoutputbucket/output --runner=DataflowRunner

Exception:
[ForkJoinPool-1-worker-5] ERROR org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler - 2018-03-26T14:05:53.829Z: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.beam.sdk.io.PatchedAvroSource.readMetadataFromFile(PatchedAvroSource.java:334)
at org.apache.beam.sdk.io.PatchedAvroSource$AvroReader.startReading(PatchedAvroSource.java:664)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:471)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:579)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:347)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:183)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:148)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:68)
at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:336)
at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:294)
at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Look into more granular Key level stats for BigDiffy

Would allow deeper statistical analysis from users who need it. E.g. building histogram of distances per key for field foo

Wiki should link to project level directories

Now that we have README's in almost all project level directories, the README should be updated to direct there instead of at the code.

Generators should work for `arbitrary[T]`

Should be fairly simple for the existing Avro, TableRow, Protobuf generators. Either verify if they work already of implement if not.

README links out of date

The links to different parts of the repository in the README are broken since refactoring the project structure

sbt-assembly not packaging all dependencies

Update to Scio 0.5.0

BigSampler fails to run with fat jar

Exception in thread "main" java.lang.IllegalAccessError: tried to access class org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl from class com.spotify.ratatool.samplers.BigSamplerBigQuery$
	at com.spotify.ratatool.samplers.BigSamplerBigQuery$.sampleBigQueryTable(BigSampler.scala:427)
	at com.spotify.ratatool.samplers.BigSampler$.singleInput(BigSampler.scala:142)
	at com.spotify.ratatool.samplers.BigSampler$.main(BigSampler.scala:162)
	at com.spotify.ratatool.samplers.BigSampler.main(BigSampler.scala)

Use farmhash instead of murmur for BigSampler

https://github.com/google/farmhash is available in guava, this would allow us to use build in hashing in BigQuery (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#farm_fingerprint).

Use sbt 1.x

Generator Examples should have a short writeup README

This should make it easier to onboard new users by referring to documentation

Generating data for a distribution

Requires some investigation. First pass can be simply a clean API to compose a set of Gen.oneOf. Potentially can become more complex; given types of distributions and randomly generating from that.

Generator Example Docs should refer to other tests in the repo

Currently the examples are fairly straight forward and only show one possible method of using the generators. We could refer to other tests as examples for testing with pipelines https://github.com/spotify/ratatool/blob/master/ratatool-sampling/src/test/scala/com/spotify/ratatool/samplers/BigSamplerTest.scala#L211

Possibly can construct similar tests for examples module but maybe overly complicated to do so

Add or update examples to show usage of amend2

The current examples were written prior to amend2, we can update them to show the distinction or potentially both ways of setting dependent fields across multiple records.

Minimize Dependencies of ratatoolCommon, ratatoolScalacheck

ratatoolCommon currently has these two as dependencies:

      "com.google.cloud.bigdataoss" % "gcs-connector" % gcsVersion,
      "org.apache.hadoop" % "hadoop-client" % hadoopVersion exclude ("org.slf4j", "slf4j-log4j12"),

These should be pulled out as I don't think it's actually required in common, and moved to only the libs that need them, since they pull in hadoop and fairly large dependencies

Consistency across ProtoBuf and Avro generators

In the past, creating a sample of either ProtoBuf or Avro types followed the same pattern:
val r = ProtoBufGenerator.protoBufOf[Foo] and
val r = AvroGenerator.avroOf[Foo].

This has changed in v0.3.0-beta1 to:
val r = ProtoBufGeneratorOps.protoBufOf[Foo].sample.get and
val r = AvroGeneratorOps.specificRecordOf[Foo].sample.get
where the methods on generator objects now diverge.

It would be nice if access patterns remain consistent

Improve doc and packaging for BigDiffy

Right now there's a one liner description of BigDiffy in README. The class is in the release assembly jar but not obvious to users since the mainClass is Tool and one has to run java -cp ratatool.jar ...BigDiffy.... We should improve this.

Split up into sub projects

Support Protobuf in BigDiffy CLI

Currently the CLI does not support Protobuf. ProtobufDiffy needs to pull in the descriptor implicitly so that it can crawl all fields, so we need a way to provide this information from CLI if we will support it.

Clearer documentation on usage of BigDiffy, BigSampler

Readme should have clear example and usage

Remove Beam dependency from Ratatool-scalacheck

https://github.com/spotify/ratatool/pull/43/files/8c7130cf6d52d2f11345d0d7f60d5743f3261c47#diff-fdc3abdfd754eeb24090dbd90aeec2ceR205

Look into cleaning up Beam dependency and using Kryo instead of CoderUtils in tests and what the effect of this will be.

BigDiffy should allow for nested unordered repeated fields

Currently if a field is REPEATED and specified in unordered, the list is sorted by doing .toString. This works for some cases but breaks down when the there is an inner field which is repeated and should be unordered.

Example:

first_repeated: [
  {key: firstval, second_repeated:[a, b, c]},
  {key: secondval, second_repeated:[1,2,3]}
]

diffed with

first_repeated: [
  {key: secondval, second_repeated:[2,1,3]},
  {key: firstval, second_repeated:[a, c, b]}
]

In this cause assuming unordered = Set("first_repeated", "first_repeated.second_repeated"). Currently on first_repeated it would convert .toString and then produce a Delta, even though a user would expect them to resolve to the same thing.