Comments (15)
The storage type stores the Lucene index on disk, so not so much memory overhead.
I think your problem might be on the distribution skew of your data. Can you share the group by of:
B.groupBy(blockingFields).count.show()
and
A.groupBy(blockingFields).count.show()
Order descending above.
It could be that the most popular blockingFields are quite large partitions.
from spark-lucenerdd.
I am running :
val linkedResults = LuceneRDD.blockEntityLinkage(B, A, linkerQuery , blockingFields, blockingFields,1000 , indexAnalyzer = "Standard" )
I have two sets :
a. A contains 117280449 records partitioned over 532 partitions.
b. B contains 353142 records partitioned over 2 partitions
from spark-lucenerdd.
from spark-lucenerdd.
Can you use more partitions than 2 on the second dataset? You may need to increase your executors memory and test again
…
On Thu, 14 Mar 2019, 17:59 Yeikel, @.***> wrote: I have two sets : a. First contains 117280449 records partitioned over 532 partitions. b. Second contains 353142 records partitioned over 2 partitions — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AByNKLtaan3yq9vFVsTrXv-v-50YzpOuks5vWn_mgaJpZM4bwuOD .
I will definitely try , but could you please why should I try? The exception seems to be at index time
Could you please clarify the storage type?
from spark-lucenerdd.
In more detail: disk mode memory maps the index, see
from spark-lucenerdd.
The storage type stores the Lucene index on disk, so not so much memory overhead.
I think your problem might be on the distribution skew of your data. Can you share the group by of:
B.groupBy(blockingFields).count.show()
and
A.groupBy(blockingFields).count.show()
Order descending above.
It could be that the most popular blockingFields are quite large partitions.
You are correct with your guess. My data is very skewed around the blocker for both datasets.
This dataset contains datapoints based on the specific country and I used the country code for the blocker as I don't need to search on data points for countries I don´t need. For some countries I have more than 50 millions datapoints and for others less than 50.
Is there any way to use the blocker and run repartition afterwards to avoid the skew data? Or what else would you recommend?
from spark-lucenerdd.
You need to extract a set of columns where the blocker makes the data as uniform as possible (over the blocks). That is a data problem, the library cannot help you, but only a domain expertise.
from spark-lucenerdd.
I don't quite understand this sentence You need to extract a set of columns where the blocker makes the data as uniform as possible
. Besides the partition size , why is this needed in the context of linkage?
For example , the blocker I selected is the same for both datasets , so based on my understanding on the blocking technique , this field it is a great candidate for blocking.
As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?
from spark-lucenerdd.
I don't quite understand this sentence
You need to extract a set of columns where the blocker makes the data as uniform as possible
.
For example, if you have extra information like country
, city
in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.
Besides the partition size , why is this needed in the context of linkage?
It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.
As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?
The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.
from spark-lucenerdd.
I don't quite understand this sentence
You need to extract a set of columns where the blocker makes the data as uniform as possible
.For example, if you have extra information like
country
,city
in you datasets, and you use only the country field as a blocker is not a good idea. It is better to split the blocks using both city and country field and split the block sizes even further.
While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.
Besides the partition size , why is this needed in the context of linkage?
It is needed for efficiency. Linkage of 1 billion records vs 1 billion records, it is not possible. Again, see the slides that I mentioned in another issue.
I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?
As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?
The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.
If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?
from spark-lucenerdd.
While I understand this , I really cannot guarantee the quality of my dataset for other fields besides country code.
Fair enough, yes you cannot apply blocking on fields that are not "clean" or categoricals.
I am a little bit confused regarding the concept of the linker vs blocker. Will the linker compare 1 billion vs 1 billion?
No it won't. I need to make the documentation more clear. See slide 15 here: http://helios.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialWWW2018.pdf
Linkage happens only within blocks.
As described , the main Issue is that this blocker creates skewed partitions. Should I add this blocker to the query itself?
The skewed partitions will only hurt the performance of the linker, i.e., the execution time will be high if there is one large block.
If this is the case , what strategy can I apply to use the blocker I know that works and reparation on it?
You need to extract columns that block your data as much as possible while linkage could happen within blocks without hurting accuracy of linkage.
For example, if you have a column city with values "Athens", "Paris", "London", etc even if it not clean , you can extra an artificial new column with only the first char "A", "P", "L". Assuming that first char is OK in terms of quality you can block in it!
from spark-lucenerdd.
It is still not clear to me how the linker is executed if there are no blockers. Could you please clarify?
from spark-lucenerdd.
If you use the linkage without blocking fields, i.e., LuceneRDD.linkDataFrame()
, then all possible pairs of records are considered in the linkage process.
from spark-lucenerdd.
@zouzias Makes sense , thank you for clarifying
from spark-lucenerdd.
For a skewed dataset , where the blocker makes uneven partitions , is there any way to reparation even further? Main goal is to increase parallelism
from spark-lucenerdd.
Related Issues (20)
- Exception "requirement failed: TopK requires at least K>0" with collect when no results are found HOT 1
- Store query that links the documents HOT 3
- Improve test coverage on Lucene Analyzers per field
- Remove dependency on sbt-spark-package HOT 2
- Support Scala 2.12 HOT 4
- Update to SBT 1.x HOT 1
- [Implicits] Support MapType for Spark DataFrames
- [question]want to knnSearch on for every record from other data frame - HOT 1
- Label Entity Linkage tasks using `sc.setJobGroup`
- Improve logging HOT 2
- Why is indexing entering a loop? HOT 4
- Weird results when running it distributed vs local HOT 5
- Help debugging a OOM issue when the search population increases HOT 3
- Question about blockdedup and call to count()
- Typesafe config is generating the error UTFDataFormatException: encoded string too long HOT 2
- Serialization Issue with org.apache.lucene.facet.FacetsConfig HOT 4
- RDD is removing null columns on fuzzy linking HOT 3
- How to search with lucenerdd in another queries' rdd? HOT 1
- Greatly improved linking performance
- Elasticsearch Snapshots? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-lucenerdd.