tresata / spark-sorted Goto Github PK
View Code? Open in Web Editor NEWSecondary sort and streaming reduce for Apache Spark
License: Apache License 2.0
Secondary sort and streaming reduce for Apache Spark
License: Apache License 2.0
Hello,
i would like to use this project in enterprise, is it still up to date ? Maybe i can add some feature im using spark 3.3 atm
Br
Nicolas
Hi, and great job.
Is there a artifact in Maven so I can use in project?
Thx in advanced,
X.
Hi, this looks like a great package - THANK YOU very much for your work here.
I had a question and wasn't sure where to inquire -- I've been trying to solve memory usage of a pattern like below.
Basically the problem is that while I'm actually intending to use groupBy+sort+map to do some work, since there isn't a fast sort-within-a-group available (fast enough == for really big data sets), I'm actually using repartition+sortWithinPartitions+mapPartitions to do the work. Which is pretty fast somehow.
The downside to my old approach is that while I need to compute something independently per-group/"id", iterating the whole partition with mapPartitions forces me to hold onto the result of that computation in-memory for EVERY group/"id" until I reach the end of the partition. We'll have a relatively large result per-group/"id". So this is my concern, and why I find this library promising.
My question is: Does spark-sorted, when using GroupSorted(rdd, comparator).mapStreamByKey actually flush the results of each group out of memory as it proceeds down the partition? I stared at the code for a couple of hours and couldn't see how mapStreamByKey is managing memory of each group. Ideally I'd like whatever the group returned from mapStreamByKey to be cleaned from memory as it's put onto the stream between each group.
This would be a huge win for us.
Pseudo-code for old way:
data.repartition("id").sortWithinPartitions("id","time").mapPartitions(partitionIterator ->
val previousId:String = null;
val output:List[Row] = new List();
partitionIterator.forEach(event -> {
val nextId = event.getAs("id");
if (nextId!=previousID) {
// put some stuff we learned while processing that ID stream on the partition into the output - ugly!
previousId=nextId
}
// remember/buffer other state information while processing the ID so we have something to output when the ID changes
}
return output
) (myRowEncoder)
Pseudo code for new way:
GroupSorted<String,Row> groupSortedRDD = new GroupSorted<>(pairRDD, new RowTimeComparator());
JavaRDD<Row> output = groupSortedRDD.mapStreamByKey(rowInGroupIterator -> {
MyStateMachine myStateMachine = new MyStateMachine();
rowInGroupIterator.forEachRemaining(row -> {
myStateMachine.receiveEvent(row);
} );
return myStateMachine.getOutput().iterator();
}).map(kvp -> kvp._2);
So ideally everything I return from myStateMachine.getOutput().iterator() would not stay in memory until the full partition walk is completed.
Hi, I used a large data set with GroupSorted, 25 million raw records, after filtering are 2 million rows, and the GroupSort is running forever to no ends. A jstack dump of the process seem to indicate it is in spilling. The allocated memory in theory should be enough or close to enough. There are a lot of keys but most key has less than a hundred rows.
java.io.ObjectInputStream.defaultReadFields(java.lang.Object, java.io.ObjectStreamClass) @bci=105, line=1986 (Compiled frame)
Hi, in the following example mapStreamByKey produces duplicate keys:
val binRDD = sc.binaryFile('file://...')
val pairs: RDD[(String, SomeCaseClass)] = binRDD.flatMap(parseBinaryStream)
val sorted: GroupSorted[(String, SomeCaseClass)] = pairs.groupSort(Some(implicitly[Ordering[SomeCaseClass]]))
val mapped: RDD[(String, AggValues)] = sorted.mapStreamByKey(aggregateSomeCaseClass)
mapper.collect.foreach(o => { /*write o._1 to a text file */ })
When I check the text file I got the following:
$ gzcat /tmp/all.txt.gz | wc -l
729109
$ gzcat /tmp/all.txt.gz | sort | uniq | wc -l
690618
But as far as I understand, mapStreamByKey should process one key exactly once.
Am I missing something?
Code ...
val result = rdd
.groupSort(Ordering.by{ ...some logic... })
.scanLeftByKey(... initial value ...){ ... some logic ...}
.flatMap{ ... }.filter{ ... }
Exception ...
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)V
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:29)
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:48)
at spikes.MyJob$.main(MyJob.scala:15)
at spikes.MyJob.main(MyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark version 1.5.2. Any ideas?
Hello,
Is it possible to provide a Java API for the library, or some example on how to use it from Java?
Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.
Hi Tresata Team,
This is a wonderful and very useful Spark Package. Would you consider making an official release for this on the Spark Packages Repository so that Spark users can use this with ease? It would be very easy to make a release if you use the sbt-spark-package plugin.
Then users can use this library by simply adding the flag:
--packages tresata/spark-sorted:0.1
to spark-shell, spark-submit or even pyspark. In addition, your package will be ranked higher in the Spark Packages website for having a release!
Users that also use the sbt-spark-package plugin can simply add your dependency as
spDependencies += "tresata/spark-sorted:0.1
in their build file.
I can happily submit a pull request for the plugin if you like!
Best,
Burak
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.