GithubHelp home page GithubHelp logo

Comments (16)

dongx-psu avatar dongx-psu commented on September 1, 2024

from simba.

micvbang avatar micvbang commented on September 1, 2024

Thanks for the fast response!

Alright, I will try without building the index.

What do you mean that something is wrong with the final predicate? It is supposed to be an inequality condition on integers :-)

I am using the 1.6 branch since I started out using that before the current master branch became available.

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

from simba.

micvbang avatar micvbang commented on September 1, 2024

Here is what I'm getting from putting .toDebugString() on my results rdd.

(1184) MapPartitionsRDD[34] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    MapPartitionsRDD[33] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    MapPartitionsRDD[32] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    ZippedPartitionsRDD2[31] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    MapPartitionsRDD[27] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    ShuffledRDD[26] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  +-(210) MapPartitionsRDD[25] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      |   MapPartitionsRDD[23] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      |   ShuffledRDD[18] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      +-(6) MapPartitionsRDD[15] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[12] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[11] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[8] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
         |  file:///tmp/spark/data/gdelt_1m.json HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []
  |    MapPartitionsRDD[30] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  |    ShuffledRDD[29] at javaToPython at NativeMethodAccessorImpl.java:-2 []
  +-(210) MapPartitionsRDD[28] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      |   MapPartitionsRDD[24] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      |   ShuffledRDD[22] at javaToPython at NativeMethodAccessorImpl.java:-2 []
      +-(6) MapPartitionsRDD[19] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[14] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[13] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[10] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[9] at javaToPython at NativeMethodAccessorImpl.java:-2 []
         |  MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 []
         |  file:///tmp/spark/data/gdelt_1m.json HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:-2 []

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

from simba.

micvbang avatar micvbang commented on September 1, 2024

Ah right, I'm sorry!

== Parsed Logical Plan ==                                                       
'Project [unresolvedalias('ARRAY('l.id,'r.id))]
+- 'Filter ('l.id < 'r.id)
   +- 'Join DistanceJoin, Some( **(pointwrapperexpression('r.euc_lon,'r.euc_lat)) IN CIRCLERANGE (pointwrapperexpression('l.euc_lon,'l.euc_lat)) within  (500)**  )
      :- 'UnresolvedRelation `input_table`, Some(l)
      +- 'UnresolvedRelation `input_table`, Some(r)

== Analyzed Logical Plan ==
_c0: array<string>
Project [array(id#3,id#12) AS _c0#18]
+- Filter (id#3 < id#12)
   +- Join DistanceJoin, Some( **(pointwrapperexpression(euc_lon#11,euc_lat#10)) IN CIRCLERANGE (pointwrapperexpression(euc_lon#2,euc_lat#1)) within  (500)**  )
      :- Subquery l
      :  +- Subquery input_table
      :     +- Relation[date#0,euc_lat#1,euc_lon#2,id#3,merc_lat#4,merc_lon#5,num_articles#6,num_mentions#7,num_sources#8] JSONRelation
      +- Subquery r
         +- Subquery input_table
            +- Relation[date#9,euc_lat#10,euc_lon#11,id#12,merc_lat#13,merc_lon#14,num_articles#15,num_mentions#16,num_sources#17] JSONRelation

== Optimized Logical Plan ==
Project [array(id#3,id#12) AS _c0#18]
+- Filter (id#3 < id#12)
   +- Join DistanceJoin, Some( **(pointwrapperexpression(euc_lon#11,euc_lat#10)) IN CIRCLERANGE (pointwrapperexpression(euc_lon#2,euc_lat#1)) within  (500)**  )
      :- Relation[date#0,euc_lat#1,euc_lon#2,id#3,merc_lat#4,merc_lon#5,num_articles#6,num_mentions#7,num_sources#8] JSONRelation
      +- Relation[date#9,euc_lat#10,euc_lon#11,id#12,merc_lat#13,merc_lon#14,num_articles#15,num_mentions#16,num_sources#17] JSONRelation

== Physical Plan ==
Project [array(id#3,id#12) AS _c0#18]
+- Filter (id#3 < id#12)
   +- DJSpark pointwrapperexpression(euc_lon#2,euc_lat#1), pointwrapperexpression(euc_lon#11,euc_lat#10), 500
      :- ConvertToSafe
      :  +- Scan JSONRelation[date#0,euc_lat#1,euc_lon#2,id#3,merc_lat#4,merc_lon#5,num_articles#6,num_mentions#7,num_sources#8] InputPaths: 
      +- ConvertToSafe
         +- Scan JSONRelation[date#9,euc_lat#10,euc_lon#11,id#12,merc_lat#13,merc_lon#14,num_articles#15,num_mentions#16,num_sources#17] InputPaths:

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

from simba.

micvbang avatar micvbang commented on September 1, 2024

Yep, I'm still experiencing the same problem.

I tried porting the code to Scala and the same thing happens. Hmm.

object GdeltTest {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("gdelt_test").setMaster("spark://127.0.0.1:7077")
    val sc = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)

    import sqlContext.implicits._

    println("hey there")
    val df1 = sqlContext.read.json("/tmp/spark/data/gdelt_1m.json")
    df1.registerTempTable("input_table")


    println("----------------------------")

    val sqlQuery = " SELECT ARRAY(l.id, r.id) ids FROM input_table l DISTANCE JOIN input_table r ON POINT(r.euc_lon, r.euc_lat) IN CIRCLERANGE(POINT(l.euc_lon, l.euc_lat), 1000) WHERE l.id < r.id"
    val df = sqlContext.sql(sqlQuery)
    println(df.queryExecution)

    val r = df.collect()
    println(r.length)
    sc.stop()
  }
}

If I go down to using 256k rows from my original dataset of 1m, Spark starts throwing OutOfMemory exceptions after approximately 1 hour.

I have started the Scala-jobs on the machine described in the first post, with the following configuration:

spark.eventLog.enabled           true                 
spark.eventLog.dir               /tmp/spark/eventlog  
spark.driver.memory              30g                  
spark.executor.memory            80g                  
spark.driver.maxResultSize       20g                  
spark.python.worker.memory       1g                   

from simba.

micvbang avatar micvbang commented on September 1, 2024

Huh. I just tried with the OSM dataset, using 700k points located in the UK. With this dataset, I can do the join using python (have not tested Scala yet) in 6 minutes.

I believe that the dataset is making the difference. I'm not yet sure what is causing the slow queries, though.

from simba.

micvbang avatar micvbang commented on September 1, 2024

Yep, almost positive the dataset making the difference. Just successfully performed a join on a different dataset of 2.5 million points in 16 minutes.

from simba.

micvbang avatar micvbang commented on September 1, 2024

After investigating my dataset, I found that it contained points that were repeated tens of thousands of times. This meant that billions of rows were generated for spatial joins of any query distance, causing the slow queries.

Thank you so much for your help in investigating my problem!

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

Yeah, that is the problem exactly I think. GDELT's spatial tag are not exact real coordinates. I think they just search the place name in google map and give a general centroid point of the area. And this is the reason why you got so many duplicates.

Thus, the reason why it is slow is simply because your output size is too big. As a result, there is no way it can be fast.

from simba.

micvbang avatar micvbang commented on September 1, 2024

That is exactly the same conclusion that I arrived at!

from simba.

micvbang avatar micvbang commented on September 1, 2024

When you said that I don't have to build indexes for joins, why is that?

Is this because indexes are built on-the-fly for joins?
Or does it have to do with the open issue that the left side of a join is always repartitioned, regardless of there being any indexes on it or not?

from simba.

dongx-psu avatar dongx-psu commented on September 1, 2024

It is a legacy issue, since we need to include partition and index time in the paper results. Thus, it will repartition and build local index all the time.

from simba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.