I am using store.mode = "disk" but I am still observ

I am running : <div class="snippet-clipboard-content notranslate position-relative

In more detail: disk mode memory maps the index, see <a href="https:

I don't quite understand this sentence <code class="notransl

GC overhead limit exceeded using Disk Mode about spark-lucenerdd HOT 15 CLOSED

zouzias commented on June 10, 2024

GC overhead limit exceeded using Disk Mode

from spark-lucenerdd.

Comments (15)

zouzias commented on June 10, 2024 1

The storage type stores the Lucene index on disk, so not so much memory overhead.

I think your problem might be on the distribution skew of your data. Can you share the group by of:

B.groupBy(blockingFields).count.show()

and

A.groupBy(blockingFields).count.show()

Order descending above.

It could be that the most popular blockingFields are quite large partitions.

from spark-lucenerdd.

yeikel commented on June 10, 2024

I am running :

val linkedResults = LuceneRDD.blockEntityLinkage(B, A, linkerQuery , blockingFields, blockingFields,1000 , indexAnalyzer = "Standard" )

I have two sets :

a. A contains 117280449 records partitioned over 532 partitions.
b. B contains 353142 records partitioned over 2 partitions

from spark-lucenerdd.

zouzias commented on June 10, 2024

Can you use more partitions than 2 on the second dataset? You may need to increase your executors memory and test again

…

On Thu, 14 Mar 2019, 17:59 Yeikel, ***@***.***> wrote: I have two sets : a. First contains 117280449 records partitioned over 532 partitions. b. Second contains 353142 records partitioned over 2 partitions — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AByNKLtaan3yq9vFVsTrXv-v-50YzpOuks5vWn_mgaJpZM4bwuOD> .

from spark-lucenerdd.

yeikel commented on June 10, 2024

Can you use more partitions than 2 on the second dataset? You may need to increase your executors memory and test again
…
On Thu, 14 Mar 2019, 17:59 Yeikel, @.***> wrote: I have two sets : a. First contains 117280449 records partitioned over 532 partitions. b. Second contains 353142 records partitioned over 2 partitions — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AByNKLtaan3yq9vFVsTrXv-v-50YzpOuks5vWn_mgaJpZM4bwuOD .

I will definitely try , but could you please why should I try? The exception seems to be at index time

Could you please clarify the storage type?

from spark-lucenerdd.

zouzias commented on June 10, 2024

In more detail: disk mode memory maps the index, see

https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/store/IndexStorable.scala#L75

from spark-lucenerdd.

yeikel commented on June 10, 2024

The storage type stores the Lucene index on disk, so not so much memory overhead.

I think your problem might be on the distribution skew of your data. Can you share the group by of:

B.groupBy(blockingFields).count.show()

and

A.groupBy(blockingFields).count.show()

Order descending above.

It could be that the most popular blockingFields are quite large partitions.

You are correct with your guess. My data is very skewed around the blocker for both datasets.

This dataset contains datapoints based on the specific country and I used the country code for the blocker as I don't need to search on data points for countries I don´t need. For some countries I have more than 50 millions datapoints and for others less than 50.

Is there any way to use the blocker and run repartition afterwards to avoid the skew data? Or what else would you recommend?

from spark-lucenerdd.

zouzias commented on June 10, 2024

You need to extract a set of columns where the blocker makes the data as uniform as possible (over the blocks). That is a data problem, the library cannot help you, but only a domain expertise.

from spark-lucenerdd.

yeikel commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible . Besides the partition size , why is this needed in the context of linkage?

For example , the blocker I selected is the same for both datasets , so based on my understanding on the blocking technique , this field it is a great candidate for blocking.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

from spark-lucenerdd.

zouzias commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible .

For example, if you have extra information like country, city in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.

Besides the partition size , why is this needed in the context of linkage?

It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

from spark-lucenerdd.

yeikel commented on June 10, 2024

I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible .

For example, if you have extra information like country, city in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.

While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.

Besides the partition size , why is this needed in the context of linkage?

It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.

I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?

from spark-lucenerdd.

zouzias commented on June 10, 2024

While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.

Fair enough, yes you cannot apply blocking on fields that are not "clean" or categoricals.

I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?

No it won't. I need to make the documentation more clear. See slide 15 here: http://helios.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialWWW2018.pdf

Linkage happens only within blocks.

As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?

The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.

If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?

You need to extract columns that block your data as much as possible while linkage could happen within blocks without hurting accuracy of linkage.

For example, if you have a column city with values "Athens", "Paris", "London", etc even if it not clean , you can extra an artificial new column with only the first char "A", "P", "L". Assuming that first char is OK in terms of quality you can block in it!

from spark-lucenerdd.

yeikel commented on June 10, 2024

It is still not clear to me how the linker is executed if there are no blockers. Could you please clarify?

from spark-lucenerdd.

zouzias commented on June 10, 2024

If you use the linkage without blocking fields, i.e., LuceneRDD.linkDataFrame() , then all possible pairs of records are considered in the linkage process.

from spark-lucenerdd.

yeikel commented on June 10, 2024

@zouzias Makes sense , thank you for clarifying

from spark-lucenerdd.

yeikel commented on June 10, 2024

For a skewed dataset , where the blocker makes uneven partitions , is there any way to reparation even further? Main goal is to increase parallelism

from spark-lucenerdd.

GC overhead limit exceeded using Disk Mode about spark-lucenerdd HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs