GithubHelp home page GithubHelp logo

Comments (15)

zouzias avatar zouzias commented on June 10, 2024 1

The storage type stores the Lucene index on disk, so not so much memory overhead.

I think your problem might be on the distribution skew of your data. Can you share the group by of:

B.groupBy(blockingFields).count.show()

and

A.groupBy(blockingFields).count.show()

Order descending above.

It could be that the most popular blockingFields are quite large partitions.

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

I am running :

val linkedResults = LuceneRDD.blockEntityLinkage(B, A, linkerQuery , blockingFields, blockingFields,1000 , indexAnalyzer = "Standard" )

I have two sets :

a. A contains 117280449 records partitioned over 532 partitions.
b. B contains 353142 records partitioned over 2 partitions

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

Can you use more partitions than 2 on the second dataset? You may need to increase your executors memory and test again

On Thu, 14 Mar 2019, 17:59 Yeikel, @.***> wrote: I have two sets : a. First contains 117280449 records partitioned over 532 partitions. b. Second contains 353142 records partitioned over 2 partitions — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AByNKLtaan3yq9vFVsTrXv-v-50YzpOuks5vWn_mgaJpZM4bwuOD .

I will definitely try , but could you please why should I try? The exception seems to be at index time

Could you please clarify the storage type?

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

In more detail: disk mode memory maps the index, see

https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/store/IndexStorable.scala#L75

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

The storage type stores the Lucene index on disk, so not so much memory overhead.

I think your problem might be on the distribution skew of your data. Can you share the group by of:

B.groupBy(blockingFields).count.show()

and

A.groupBy(blockingFields).count.show()

Order descending above.

It could be that the most popular blockingFields are quite large partitions.

You are correct with your guess. My data is very skewed around the blocker for both datasets.

This dataset contains datapoints based on the specific country and I used the country code for the blocker as I don't need to search on data points for countries I don´t need. For some countries I have more than 50 millions datapoints and for others less than 50.

Is there any way to use the blocker and run repartition afterwards to avoid the skew data? Or what else would you recommend?

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

You need to extract a set of columns where the blocker makes the data as uniform as possible (over the blocks). That is a data problem, the library cannot help you, but only a domain expertise.

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible . Besides the partition size , why is this needed in the context of linkage?

For example , the blocker I selected is the same for both datasets , so based on my understanding on the blocking technique , this field it is a great candidate for blocking.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible .

For example, if you have extra information like country, city in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.

Besides the partition size , why is this needed in the context of linkage?

It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible .

For example, if you have extra information like country, city in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.

While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.

Besides the partition size , why is this needed in the context of linkage?

It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.

I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.

Fair enough, yes you cannot apply blocking on fields that are not "clean" or categoricals.

I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?

No it won't. I need to make the documentation more clear. See slide 15 here: http://helios.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialWWW2018.pdf

Linkage happens only within blocks.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?

You need to extract columns that block your data as much as possible while linkage could happen within blocks without hurting accuracy of linkage.

For example, if you have a column city with values "Athens", "Paris", "London", etc even if it not clean , you can extra an artificial new column with only the first char "A", "P", "L". Assuming that first char is OK in terms of quality you can block in it!

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

It is still not clear to me how the linker is executed if there are no blockers. Could you please clarify?

from spark-lucenerdd.

zouzias avatar zouzias commented on June 10, 2024

If you use the linkage without blocking fields, i.e., LuceneRDD.linkDataFrame() , then all possible pairs of records are considered in the linkage process.

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

@zouzias Makes sense , thank you for clarifying

from spark-lucenerdd.

yeikel avatar yeikel commented on June 10, 2024

For a skewed dataset , where the blocker makes uneven partitions , is there any way to reparation even further? Main goal is to increase parallelism

from spark-lucenerdd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.