GithubHelp home page GithubHelp logo

spark-sorted's People

Contributors

alno avatar brkyvz avatar koertkuipers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-sorted's Issues

Still Up ?

Hello,
i would like to use this project in enterprise, is it still up to date ? Maybe i can add some feature im using spark 3.3 atm

Br
Nicolas

Question: Memory Performance

Hi, this looks like a great package - THANK YOU very much for your work here.

I had a question and wasn't sure where to inquire -- I've been trying to solve memory usage of a pattern like below.

Basically the problem is that while I'm actually intending to use groupBy+sort+map to do some work, since there isn't a fast sort-within-a-group available (fast enough == for really big data sets), I'm actually using repartition+sortWithinPartitions+mapPartitions to do the work. Which is pretty fast somehow.

The downside to my old approach is that while I need to compute something independently per-group/"id", iterating the whole partition with mapPartitions forces me to hold onto the result of that computation in-memory for EVERY group/"id" until I reach the end of the partition. We'll have a relatively large result per-group/"id". So this is my concern, and why I find this library promising.

My question is: Does spark-sorted, when using GroupSorted(rdd, comparator).mapStreamByKey actually flush the results of each group out of memory as it proceeds down the partition? I stared at the code for a couple of hours and couldn't see how mapStreamByKey is managing memory of each group. Ideally I'd like whatever the group returned from mapStreamByKey to be cleaned from memory as it's put onto the stream between each group.

This would be a huge win for us.

Pseudo-code for old way:

data.repartition("id").sortWithinPartitions("id","time").mapPartitions(partitionIterator -> 
  val previousId:String = null;
  val output:List[Row] = new List();
  partitionIterator.forEach(event -> {
     val nextId = event.getAs("id");
     if (nextId!=previousID) {
       // put some stuff we learned while processing that ID stream on the partition into the output - ugly!
      previousId=nextId
    }
    // remember/buffer other state information while processing the ID so we have something to output when the ID changes
  }
  return output
) (myRowEncoder) 

Pseudo code for new way:

        GroupSorted<String,Row> groupSortedRDD = new GroupSorted<>(pairRDD, new RowTimeComparator());
        JavaRDD<Row> output = groupSortedRDD.mapStreamByKey(rowInGroupIterator -> {
           MyStateMachine myStateMachine = new MyStateMachine();
            rowInGroupIterator.forEachRemaining(row -> {
               myStateMachine.receiveEvent(row);
            } );
            return myStateMachine.getOutput().iterator();
        }).map(kvp -> kvp._2);

So ideally everything I return from myStateMachine.getOutput().iterator() would not stay in memory until the full partition walk is completed.

Need help find out why spilling is so slow

Hi, I used a large data set with GroupSorted, 25 million raw records, after filtering are 2 million rows, and the GroupSort is running forever to no ends. A jstack dump of the process seem to indicate it is in spilling. The allocated memory in theory should be enough or close to enough. There are a lot of keys but most key has less than a hundred rows.

java.io.ObjectInputStream.defaultReadFields(java.lang.Object, java.io.ObjectStreamClass) @bci=105, line=1986 (Compiled frame)

  • java.io.ObjectInputStream.readSerialData(java.lang.Object, java.io.ObjectStreamClass) @bci=173, line=1915 (Compiled frame)
  • java.io.ObjectInputStream.readOrdinaryObject(boolean) @bci=184, line=1798 (Compiled frame)
  • java.io.ObjectInputStream.readObject0(boolean) @bci=389, line=1350 (Compiled frame)
  • java.io.ObjectInputStream.defaultReadFields(java.lang.Object, java.io.ObjectStreamClass) @bci=150, line=1990 (Compiled frame)
  • java.io.ObjectInputStream.readSerialData(java.lang.Object, java.io.ObjectStreamClass) @bci=173, line=1915 (Compiled frame)
  • java.io.ObjectInputStream.readOrdinaryObject(boolean) @bci=184, line=1798 (Compiled frame)
  • java.io.ObjectInputStream.readObject0(boolean) @bci=389, line=1350 (Compiled frame)
  • java.io.ObjectInputStream.readObject() @bci=19, line=370 (Compiled frame)
  • org.apache.spark.serializer.JavaDeserializationStream.readObject(scala.reflect.ClassTag) @bci=4, line=68 (Compiled frame)
  • org.apache.spark.util.collection.ExternalSorter$SpillReader.org$apache$spark$util$collection$ExternalSorter$SpillReader$$readNextItem() @bci=28, line=598 (Compiled frame)
  • org.apache.spark.util.collection.ExternalSorter$SpillReader$$anon$5.hasNext() @bci=18, line=628 (Compiled frame)
  • scala.collection.Iterator$$anon$1.hasNext() @bci=11, line=847 (Compiled frame)
  • org.apache.spark.util.collection.ExternalSorter$$anon$2.next() @bci=29, line=427 (Compiled frame)
  • org.apache.spark.util.collection.ExternalSorter$$anon$2.next() @bci=1, line=418 (Compiled frame)
  • scala.collection.Iterator$$anon$13.next() @bci=20, line=372 (Compiled frame)
  • scala.collection.Iterator$$anon$11.next() @bci=8, line=328 (Compiled frame)
  • scala.collection.Iterator$$anon$1.next() @bci=23, line=853 (Compiled frame)
  • scala.collection.Iterator$$anon$1.head() @bci=9, line=840 (Compiled frame)
  • GroupSorted$$anonfun$mapStreamByKey$1$$anon$2.hasNext() @bci=16, line=33 (Compiled frame)

Duplicate keys in mapStreamByKey (spark-sorted: 0.3; scala: 2.10.4; spark: 1.3.1)

Hi, in the following example mapStreamByKey produces duplicate keys:

val binRDD = sc.binaryFile('file://...')
val pairs: RDD[(String, SomeCaseClass)] = binRDD.flatMap(parseBinaryStream)
val sorted: GroupSorted[(String, SomeCaseClass)] = pairs.groupSort(Some(implicitly[Ordering[SomeCaseClass]]))
val mapped: RDD[(String, AggValues)] = sorted.mapStreamByKey(aggregateSomeCaseClass)
mapper.collect.foreach(o => { /*write o._1 to a text file */ })

When I check the text file I got the following:
$ gzcat /tmp/all.txt.gz | wc -l
729109
$ gzcat /tmp/all.txt.gz | sort | uniq | wc -l
690618

But as far as I understand, mapStreamByKey should process one key exactly once.
Am I missing something?

Getting a strange exception...

Code ...
val result = rdd
.groupSort(Ordering.by{ ...some logic... })
.scanLeftByKey(... initial value ...){ ... some logic ...}
.flatMap{ ... }.filter{ ... }

Exception ...
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.rdd.ShuffledRDD.(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/Partitioner;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)V
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:29)
at com.tresata.spark.sorted.PairRDDFunctions.groupSort(PairRDDFunctions.scala:48)
at spikes.MyJob$.main(MyJob.scala:15)
at spikes.MyJob.main(MyJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark version 1.5.2. Any ideas?

Java api for GroupSorted

Hello,

Is it possible to provide a Java API for the library, or some example on how to use it from Java?

mapStreamByKey fails silent if treated as flatMap

Using the Java API in 0.3.0, I inadvertently implemented a mapStreamByKey function that returned an Iterator over a different (lower, sometimes zero) number of elements to that in the input iterator i.e. I wrote a flatMap where I needed a regular Map. This did not cause an error, but produced an incorrect result, seemingly causing early termination such that many keys were never presented to the iterator. So this issue is in two parts: a) using mapStreamByKey should fail with an appropriate exception if the input and output sizes do not match and b) please can we have an equivalent flatMapStreamByKey function in the API. thx.

sbt-spark-package integration and Spark Packages release

Hi Tresata Team,

This is a wonderful and very useful Spark Package. Would you consider making an official release for this on the Spark Packages Repository so that Spark users can use this with ease? It would be very easy to make a release if you use the sbt-spark-package plugin.
Then users can use this library by simply adding the flag:
--packages tresata/spark-sorted:0.1
to spark-shell, spark-submit or even pyspark. In addition, your package will be ranked higher in the Spark Packages website for having a release!

Users that also use the sbt-spark-package plugin can simply add your dependency as
spDependencies += "tresata/spark-sorted:0.1
in their build file.

I can happily submit a pull request for the plugin if you like!

Best,
Burak

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.