GithubHelp home page GithubHelp logo

spotify / ratatool Goto Github PK

View Code? Open in Web Editor NEW
337.0 28.0 55.0 1.28 MB

A tool for data sampling, data generation, and data diffing

License: Apache License 2.0

Scala 84.49% Java 15.51%
scala scalacheck avro parquet bigquery protobuf

ratatool's Issues

README links out of date

The links to different parts of the repository in the README are broken since refactoring the project structure

BigDiffy should allow for nested unordered repeated fields

Currently if a field is REPEATED and specified in unordered, the list is sorted by doing .toString. This works for some cases but breaks down when the there is an inner field which is repeated and should be unordered.

Example:

first_repeated: [
  {key: firstval, second_repeated:[a, b, c]},
  {key: secondval, second_repeated:[1,2,3]}
]

diffed with

first_repeated: [
  {key: secondval, second_repeated:[2,1,3]},
  {key: firstval, second_repeated:[a, c, b]}
]

In this cause assuming unordered = Set("first_repeated", "first_repeated.second_repeated"). Currently on first_repeated it would convert .toString and then produce a Delta, even though a user would expect them to resolve to the same thing.

Support Protobuf in BigDiffy CLI

Currently the CLI does not support Protobuf. ProtobufDiffy needs to pull in the descriptor implicitly so that it can crawl all fields, so we need a way to provide this information from CLI if we will support it.

Upgrade protobuf compiler (protoc) to version 3 and support V3 protobuf messages

Currently compiling protobuf protoc with -v261 despite depending on the latest protoc-jar (3.3.0.1).

Across compilers v2 and v3, the internal representation of a generic message was changed from GeneratedMessage to GeneratedMessageV3 which means functions that expect evidence of that type will fail when compiled with v3. For example, protobufOf expects GeneratedMessage so users of ratatool compiling with v3 will not be able to use this function.

Both GeneratedMessage and GeneratedMessageV3 extend AbstractMessage. Suggestion is to support both compiler versions by having evidence of AbstractMessage.

AvroGenerator intermittently returns duplicates

Calling AvroGenerator.avroOf[T] repeatedly will sometimes generate two identical results. Not sure if this qualifies as a pure bug, but there is a great risk of it leading to flaky test cases for those who are unaware of it.

The cause is likely to be that the underlying RandomData by default is seeded with System.currentTimeMillis, which may sometimes not change between fast subsequent calls.

Example:

  val t1 = System.currentTimeMillis
  val element1 = AvroGenerator.avroOf[StartContext]
  val t2 = System.currentTimeMillis
  val element2 = AvroGenerator.avroOf[StartContext]
  val t3 = System.currentTimeMillis
  val element3 = AvroGenerator.avroOf[StartContext]
  println(s"**** t1: $t1, t2: $t2, t3: $t3")
  println(s"**** e1 == e2: ${element1 == element2}, e2 == e3: ${element2 == element3}")
  println(s"**** e1 eq e2: ${element1 eq element2}, e2 eq e3: ${element2 eq element3}")

Produces the following output:

**** t1: 1503492962231, t2: 1503492962500, t3: 1503492962500
**** e1 == e2: false, e2 == e3: true
**** e1 eq e2: false, e2 eq e3: false

The issue can be worked around by inserting a Thread.sleep(1) between the calls.

Suggested solutions could be to either allow the caller to pass in a seed, or to add a generalized version AvroGenerator.avroOf[T] that returns a sequence of items, using the underlying RandomData.iterator().

`.amend` for Protobuf Generators

Similar to existing implicits for TableRow, SpecificRecord. Potentially expensive because of how writing to Proto output stream works.

Consistency across ProtoBuf and Avro generators

In the past, creating a sample of either ProtoBuf or Avro types followed the same pattern:
val r = ProtoBufGenerator.protoBufOf[Foo] and
val r = AvroGenerator.avroOf[Foo].

This has changed in v0.3.0-beta1 to:
val r = ProtoBufGeneratorOps.protoBufOf[Foo].sample.get and
val r = AvroGeneratorOps.specificRecordOf[Foo].sample.get
where the methods on generator objects now diverge.

It would be nice if access patterns remain consistent

Gen with dependency

Case:
Gen[T] is a function on values from Gen[U] and user needs to be able to specify/guarantee this behaviour

BigSampler fails to run with fat jar

Exception in thread "main" java.lang.IllegalAccessError: tried to access class org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl from class com.spotify.ratatool.samplers.BigSamplerBigQuery$
	at com.spotify.ratatool.samplers.BigSamplerBigQuery$.sampleBigQueryTable(BigSampler.scala:427)
	at com.spotify.ratatool.samplers.BigSampler$.singleInput(BigSampler.scala:142)
	at com.spotify.ratatool.samplers.BigSampler$.main(BigSampler.scala:162)
	at com.spotify.ratatool.samplers.BigSampler.main(BigSampler.scala)

Minimize Dependencies of ratatoolCommon, ratatoolScalacheck

ratatoolCommon currently has these two as dependencies:

      "com.google.cloud.bigdataoss" % "gcs-connector" % gcsVersion,
      "org.apache.hadoop" % "hadoop-client" % hadoopVersion exclude ("org.slf4j", "slf4j-log4j12"),

These should be pulled out as I don't think it's actually required in common, and moved to only the libs that need them, since they pull in hadoop and fairly large dependencies

Keep guava in step with Scio

Currently we are on 0.21 while Scio is on 0.20. Should investigate downgrade and potential issues. Concerned this may cause unexpected errors when running.

BigSampler fails on finding files in GCS

INFO: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File `gs://endsong.ap.raw-events.spotify.net/com.spotify.LogRecord.ap.EndSong/2018-03-23/10/part-54337a5a9386bf99cf9a0ee5213a6996-000050.avro` does not exist!
INFO:     at scala.Predef$.require(Predef.scala:277)
INFO:     at com.spotify.ratatool.io.AvroIO$.getAvroSchemaFromFile(AvroIO.scala:101)

Take in multiple functions in amend

An example of this could be

.amend(intGen)(m => i => {
        m.setField1(i)
        m.setField2(dependentIntFunc(i))
      })

would become

.amend(intGen)(,_.setField1, _.setField2)

Generating data for a distribution

Requires some investigation. First pass can be simply a clean API to compose a set of Gen.oneOf. Potentially can become more complex; given types of distributions and randomly generating from that.

Improve doc and packaging for BigDiffy

Right now there's a one liner description of BigDiffy in README. The class is in the release assembly jar but not obvious to users since the mainClass is Tool and one has to run java -cp ratatool.jar ...BigDiffy.... We should improve this.

Potentially better method for setting related/dependent fields

As an example if we have schema where field country and city have an implicit relationship (city should be within country), currently we have to pass the Gen[Country] around and map to a City during the amend, or we have to create a Gen[(Country, City)] and set the relevant fields using that.

This isn't terrible but maybe can be improved

Look into reducing Guava deps

Mostly it is pulled it to maintain consistency across libs but this potentially causes issues. Should investigate further

exception when running BigDiffy

command used:

java -cp ratatool-0.2.5.jar com.spotify.ratatool.diffy.BigDiffy --mode=avro --key=MyKey --lhs=gs://mybucket/*.avro --rhs=gs://myotherbucket/*.avro --output=gs://myoutputbucket/output --runner=DataflowRunner

Exception:
[ForkJoinPool-1-worker-5] ERROR org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler - 2018-03-26T14:05:53.829Z: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.beam.sdk.io.PatchedAvroSource.readMetadataFromFile(PatchedAvroSource.java:334)
at org.apache.beam.sdk.io.PatchedAvroSource$AvroReader.startReading(PatchedAvroSource.java:664)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:471)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:579)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:347)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:183)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:148)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:68)
at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:336)
at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:294)
at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.