spotify / ratatool Goto Github PK
View Code? Open in Web Editor NEWA tool for data sampling, data generation, and data diffing
License: Apache License 2.0
A tool for data sampling, data generation, and data diffing
License: Apache License 2.0
The links to different parts of the repository in the README are broken since refactoring the project structure
Readme should have clear example and usage
Currently the output is hard to understand without knowing the code
Would allow deeper statistical analysis from users who need it. E.g. building histogram of distances per key for field foo
Currently if a field is REPEATED
and specified in unordered
, the list is sorted by doing .toString
. This works for some cases but breaks down when the there is an inner field which is repeated and should be unordered.
Example:
first_repeated: [
{key: firstval, second_repeated:[a, b, c]},
{key: secondval, second_repeated:[1,2,3]}
]
diffed with
first_repeated: [
{key: secondval, second_repeated:[2,1,3]},
{key: firstval, second_repeated:[a, c, b]}
]
In this cause assuming unordered = Set("first_repeated", "first_repeated.second_repeated")
. Currently on first_repeated
it would convert .toString
and then produce a Delta, even though a user would expect them to resolve to the same thing.
https://github.com/spotify/ratatool/pull/49/files already contains an example for Gen[T :< SpecificRecord
. We should do another example for TableRow
Currently the CLI does not support Protobuf. ProtobufDiffy
needs to pull in the descriptor implicitly so that it can crawl all fields, so we need a way to provide this information from CLI if we will support it.
Currently the examples are fairly straight forward and only show one possible method of using the generators. We could refer to other tests as examples for testing with pipelines https://github.com/spotify/ratatool/blob/master/ratatool-sampling/src/test/scala/com/spotify/ratatool/samplers/BigSamplerTest.scala#L211
Possibly can construct similar tests for examples module but maybe overly complicated to do so
Look into cleaning up Beam dependency and using Kryo instead of CoderUtils in tests and what the effect of this will be.
For performance reasons and when sampling large number of records to file, etc.
Currently compiling protobuf protoc with -v261
despite depending on the latest protoc-jar (3.3.0.1
).
Across compilers v2 and v3, the internal representation of a generic message was changed from GeneratedMessage
to GeneratedMessageV3
which means functions that expect evidence of that type will fail when compiled with v3. For example, protobufOf
expects GeneratedMessage
so users of ratatool compiling with v3 will not be able to use this function.
Both GeneratedMessage
and GeneratedMessageV3
extend AbstractMessage
. Suggestion is to support both compiler versions by having evidence of AbstractMessage
.
Calling AvroGenerator.avroOf[T]
repeatedly will sometimes generate two identical results. Not sure if this qualifies as a pure bug, but there is a great risk of it leading to flaky test cases for those who are unaware of it.
The cause is likely to be that the underlying RandomData
by default is seeded with System.currentTimeMillis
, which may sometimes not change between fast subsequent calls.
Example:
val t1 = System.currentTimeMillis val element1 = AvroGenerator.avroOf[StartContext] val t2 = System.currentTimeMillis val element2 = AvroGenerator.avroOf[StartContext] val t3 = System.currentTimeMillis val element3 = AvroGenerator.avroOf[StartContext] println(s"**** t1: $t1, t2: $t2, t3: $t3") println(s"**** e1 == e2: ${element1 == element2}, e2 == e3: ${element2 == element3}") println(s"**** e1 eq e2: ${element1 eq element2}, e2 eq e3: ${element2 eq element3}")
Produces the following output:
**** t1: 1503492962231, t2: 1503492962500, t3: 1503492962500
**** e1 == e2: false, e2 == e3: true
**** e1 eq e2: false, e2 eq e3: false
The issue can be worked around by inserting a Thread.sleep(1)
between the calls.
Suggested solutions could be to either allow the caller to pass in a seed, or to add a generalized version AvroGenerator.avroOf[T]
that returns a sequence of items, using the underlying RandomData.iterator()
.
Similar to existing implicits for TableRow
, SpecificRecord
. Potentially expensive because of how writing to Proto output stream works.
Should be fairly simple for the existing Avro, TableRow, Protobuf generators. Either verify if they work already of implement if not.
In the past, creating a sample of either ProtoBuf or Avro types followed the same pattern:
val r = ProtoBufGenerator.protoBufOf[Foo]
and
val r = AvroGenerator.avroOf[Foo]
.
This has changed in v0.3.0-beta1
to:
val r = ProtoBufGeneratorOps.protoBufOf[Foo].sample.get
and
val r = AvroGeneratorOps.specificRecordOf[Foo].sample.get
where the methods on generator objects now diverge.
It would be nice if access patterns remain consistent
Case:
Gen[T] is a function on values from Gen[U] and user needs to be able to specify/guarantee this behaviour
Exception in thread "main" java.lang.IllegalAccessError: tried to access class org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl from class com.spotify.ratatool.samplers.BigSamplerBigQuery$
at com.spotify.ratatool.samplers.BigSamplerBigQuery$.sampleBigQueryTable(BigSampler.scala:427)
at com.spotify.ratatool.samplers.BigSampler$.singleInput(BigSampler.scala:142)
at com.spotify.ratatool.samplers.BigSampler$.main(BigSampler.scala:162)
at com.spotify.ratatool.samplers.BigSampler.main(BigSampler.scala)
ratatoolCommon
currently has these two as dependencies:
"com.google.cloud.bigdataoss" % "gcs-connector" % gcsVersion,
"org.apache.hadoop" % "hadoop-client" % hadoopVersion exclude ("org.slf4j", "slf4j-log4j12"),
These should be pulled out as I don't think it's actually required in common, and moved to only the libs that need them, since they pull in hadoop and fairly large dependencies
Currently we are on 0.21 while Scio is on 0.20. Should investigate downgrade and potential issues. Concerned this may cause unexpected errors when running.
INFO: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File `gs://endsong.ap.raw-events.spotify.net/com.spotify.LogRecord.ap.EndSong/2018-03-23/10/part-54337a5a9386bf99cf9a0ee5213a6996-000050.avro` does not exist!
INFO: at scala.Predef$.require(Predef.scala:277)
INFO: at com.spotify.ratatool.io.AvroIO$.getAvroSchemaFromFile(AvroIO.scala:101)
Those classes extend IndexedRecord
but not GenericRecord
.
Includes initial seed setting, which can potentially enable #81
Also potentially Uniform distribution
An example of this could be
.amend(intGen)(m => i => {
m.setField1(i)
m.setField2(dependentIntFunc(i))
})
would become
.amend(intGen)(,_.setField1, _.setField2)
https://github.com/google/farmhash is available in guava, this would allow us to use build in hashing in BigQuery (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#farm_fingerprint).
Requires some investigation. First pass can be simply a clean API to compose a set of Gen.oneOf
. Potentially can become more complex; given types of distributions and randomly generating from that.
This should make it easier to onboard new users by referring to documentation
Similar to Avro, make all other Gen[T]
deterministic.
Right now there's a one liner description of BigDiffy in README. The class is in the release assembly jar but not obvious to users since the mainClass is Tool
and one has to run java -cp ratatool.jar ...BigDiffy...
. We should improve this.
As an example if we have schema where field country
and city
have an implicit relationship (city should be within country), currently we have to pass the Gen[Country]
around and map to a City
during the amend, or we have to create a Gen[(Country, City)]
and set the relevant fields using that.
This isn't terrible but maybe can be improved
Generators, Diffy, Sampling should be split to keep dependencies separate
We can derive Diffy[T]
for case classes using shapeless or magnolia similar to this: https://github.com/nevillelyh/shapeless-datatype/tree/neville/diff/core/src/main/scala/shapeless/datatype/diff.
Also refactor Diffy[T]
for Avro/BigQuery/Protobuf, etc. to share the same interface.
Mostly it is pulled it to maintain consistency across libs but this potentially causes issues. Should investigate further
Now that we have README's in almost all project level directories, the README should be updated to direct there instead of at the code.
Currently it is mainly a wrapper for sampler, but the jar can be directly invoked to run BigDiffy
or BigSampler
. It would be better to have the CLI wrap all of this functionality to be easier to use.
command used:
java -cp ratatool-0.2.5.jar com.spotify.ratatool.diffy.BigDiffy --mode=avro --key=MyKey --lhs=gs://mybucket/*.avro --rhs=gs://myotherbucket/*.avro --output=gs://myoutputbucket/output --runner=DataflowRunner
Exception:
[ForkJoinPool-1-worker-5] ERROR org.apache.beam.runners.dataflow.util.MonitoringUtil$LoggingHandler - 2018-03-26T14:05:53.829Z: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.beam.sdk.io.PatchedAvroSource.readMetadataFromFile(PatchedAvroSource.java:334)
at org.apache.beam.sdk.io.PatchedAvroSource$AvroReader.startReading(PatchedAvroSource.java:664)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:471)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
at com.google.cloud.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:579)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:347)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:183)
at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:148)
at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:68)
at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:336)
at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:294)
at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Also involves updating Travis. Removing in favour of sbt pack
as a way to simplify dep issues and more cleanly address #72
The current examples were written prior to amend2
, we can update them to show the distinction or potentially both ways of setting dependent fields across multiple records.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.