tresata / spark-sorted Goto Github PK

View Code? Open in Web Editor NEW

78.0 78.0 17.0 170 KB

Secondary sort and streaming reduce for Apache Spark

License: Apache License 2.0

Scala 83.60% Java 16.40%

spark-sorted's People

Contributors

Stargazers

Watchers

Forkers

brkyvz wypb yimuniao kalyankumarpichuka praveen-symphony hanumathrao knoldernarayan sourabhchaki jodevak wangdongsong acjc seshendranath warmdog chan150 mintonmu geoheil

spark-sorted's Issues

Still Up ?

Hello,
i would like to use this project in enterprise, is it still up to date ? Maybe i can add some feature im using spark 3.3 atm

Br
Nicolas

Is there a artifact in Maven so I can use in project?

Hi, and great job.

Is there a artifact in Maven so I can use in project?

Thx in advanced,
X.

Question: Memory Performance

Hi, this looks like a great package - THANK YOU very much for your work here.

I had a question and wasn't sure where to inquire -- I've been trying to solve memory usage of a pattern like below.

Basically the problem is that while I'm actually intending to use groupBy+sort+map to do some work, since there isn't a fast sort-within-a-group available (fast enough == for really big data sets), I'm actually using repartition+sortWithinPartitions+mapPartitions to do the work. Which is pretty fast somehow.

The downside to my old approach is that while I need to compute something independently per-group/"id", iterating the whole partition with mapPartitions forces me to hold onto the result of that computation in-memory for EVERY group/"id" until I reach the end of the partition. We'll have a relatively large result per-group/"id". So this is my concern, and why I find this library promising.

My question is: Does spark-sorted, when using GroupSorted(rdd, comparator).mapStreamByKey actually flush the results of each group out of memory as it proceeds down the partition? I stared at the code for a couple of hours and couldn't see how mapStreamByKey is managing memory of each group. Ideally I'd like whatever the group returned from mapStreamByKey to be cleaned from memory as it's put onto the stream between each group.

This would be a huge win for us.

Pseudo-code for old way:

data.repartition("id").sortWithinPartitions("id","time").mapPartitions(partitionIterator -> 
  val previousId:String = null;
  val output:List[Row] = new List();
  partitionIterator.forEach(event -> {
     val nextId = event.getAs("id");
     if (nextId!=previousID) {
       // put some stuff we learned while processing that ID stream on the partition into the output - ugly!
      previousId=nextId
    }
    // remember/buffer other state information while processing the ID so we have something to output when the ID changes
  }
  return output
) (myRowEncoder)

Pseudo code for new way:

        GroupSorted<String,Row> groupSortedRDD = new GroupSorted<>(pairRDD, new RowTimeComparator());
        JavaRDD<Row> output = groupSortedRDD.mapStreamByKey(rowInGroupIterator -> {
           MyStateMachine myStateMachine = new MyStateMachine();
            rowInGroupIterator.forEachRemaining(row -> {
               myStateMachine.receiveEvent(row);
            } );
            return myStateMachine.getOutput().iterator();
        }).map(kvp -> kvp._2);

So ideally everything I return from myStateMachine.getOutput().iterator() would not stay in memory until the full partition walk is completed.

Need help find out why spilling is so slow

Hi, I used a large data set with GroupSorted, 25 million raw records, after filtering are 2 million rows, and the GroupSort is running forever to no ends. A jstack dump of the process seem to indicate it is in spilling. The allocated memory in theory should be enough or close to enough. There are a lot of keys but most key has less than a hundred rows.

java.io.ObjectInputStream.defaultReadFields(java.lang.Object, java.io.ObjectStreamClass) @bci=105, line=1986 (Compiled frame)

java.io.ObjectInputStream.readSerialData(java.lang.Object, java.io.ObjectStreamClass) @bci=173, line=1915 (Compiled frame)
java.io.ObjectInputStream.readOrdinaryObject(boolean) @bci=184, line=1798 (Compiled frame)
java.io.ObjectInputStream.readObject0(boolean) @bci=389, line=1350 (Compiled frame)
java.io.ObjectInputStream.defaultReadFields(java.lang.Object, java.io.ObjectStreamClass) @bci=150, line=1990 (Compiled frame)
java.io.ObjectInputStream.readSerialData(java.lang.Object, java.io.ObjectStreamClass) @bci=173, line=1915 (Compiled frame)
java.io.ObjectInputStream.readOrdinaryObject(boolean) @bci=184, line=1798 (Compiled frame)
java.io.ObjectInputStream.readObject0(boolean) @bci=389, line=1350 (Compiled frame)
java.io.ObjectInputStream.readObject() @bci=19, line=370 (Compiled frame)
org.apache.spark.serializer.JavaDeserializationStream.readObject(scala.reflect.ClassTag) @bci=4, line=68 (Compiled frame)
org.apache.spark.util.collection.ExternalSorter$SpillReader.org$apache$spark$util$collection$ExternalSorter$SpillReader$$readNextItem() @bci=28, line=598 (Compiled frame)
org.apache.spark.util.collection.ExternalSorter$SpillReader$$anon$5.hasNext() @bci=18, line=628 (Compiled frame)
scala.collection.Iterator$$anon$1.hasNext() @bci=11, line=847 (Compiled frame)
org.apache.spark.util.collection.ExternalSorter$$anon$2.next() @bci=29, line=427 (Compiled frame)
org.apache.spark.util.collection.ExternalSorter$$anon$2.next() @bci=1, line=418 (Compiled frame)
scala.collection.Iterator$$anon$13.next() @bci=20, line=372 (Compiled frame)
scala.collection.Iterator$$anon$11.next() @bci=8, line=328 (Compiled frame)
scala.collection.Iterator$$anon$1.next() @bci=23, line=853 (Compiled frame)
scala.collection.Iterator$$anon$1.head() @bci=9, line=840 (Compiled frame)
GroupSorted$$anonfun$mapStreamByKey$1$$anon$2.hasNext() @bci=16, line=33 (Compiled frame)

Duplicate keys in mapStreamByKey (spark-sorted: 0.3; scala: 2.10.4; spark: 1.3.1)

Hi, in the following example mapStreamByKey produces duplicate keys:

val binRDD = sc.binaryFile('file://...')
val pairs: RDD[(String, SomeCaseClass)] = binRDD.flatMap(parseBinaryStream)
val sorted: GroupSorted[(String, SomeCaseClass)] = pairs.groupSort(Some(implicitly[Ordering[SomeCaseClass]]))
val mapped: RDD[(String, AggValues)] = sorted.mapStreamByKey(aggregateSomeCaseClass)
mapper.collect.foreach(o => { /*write o._1 to a text file */ })

When I check the text file I got the following:
$ gzcat /tmp/all.txt.gz | wc -l
729109
$ gzcat /tmp/all.txt.gz | sort | uniq | wc -l
690618

But as far as I understand, mapStreamByKey should process one key exactly once.
Am I missing something?

Getting a strange exception...

Code ...
val result = rdd
.groupSort(Ordering.by{ ...some logic... })
.scanLeftByKey(... initial value ...){ ... some logic ...}
.flatMap{ ... }.filter{ ... }

Exception ...
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)V
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:29)
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:48)
at spikes.MyJob$.main(MyJob.scala:15)
at spikes.MyJob.main(MyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark version 1.5.2. Any ideas?

Java api for GroupSorted

Hello,

Is it possible to provide a Java API for the library, or some example on how to use it from Java?

mapStreamByKey fails silent if treated as flatMap

Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.

sbt-spark-package integration and Spark Packages release

Hi Tresata Team,

This is a wonderful and very useful Spark Package. Would you consider making an official release for this on the Spark Packages Repository so that Spark users can use this with ease? It would be very easy to make a release if you use the sbt-spark-package plugin.
Then users can use this library by simply adding the flag:
--packages tresata/spark-sorted:0.1
to spark-shell, spark-submit or even pyspark. In addition, your package will be ranked higher in the Spark Packages website for having a release!

Users that also use the sbt-spark-package plugin can simply add your dependency as
spDependencies += "tresata/spark-sorted:0.1
in their build file.

I can happily submit a pull request for the plugin if you like!

Best,
Burak

tresata / spark-sorted Goto Github PK

spark-sorted's People

Contributors

Stargazers

Watchers

Forkers

spark-sorted's Issues

Still Up ?

Is there a artifact in Maven so I can use in project?

Question: Memory Performance

Need help find out why spilling is so slow

Duplicate keys in mapStreamByKey (spark-sorted: 0.3; scala: 2.10.4; spark: 1.3.1)

Getting a strange exception...

Java api for GroupSorted

mapStreamByKey fails silent if treated as flatMap

sbt-spark-package integration and Spark Packages release

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs