Comments (8)
If nobody else is working on this, I think I'd like to take it!
from lucene.
@msfroh - I was looking into this as well and had some thoughts about how to do it.
We could replace ParallelTaxonomyArrays
with a new interface that offers three operations for each of the arrays:
interface ChunkedParallelTaxonomyArrays {
/* Record new entry. */
public void appendParent(int parent);
/* Retrieve this ordinal's parent. From the user's perspective, this is like an array look-up. */
public int getParent(int ord);
/* There are some places where we need to know how many parents exist in total. */
public int sizeParents();
// Same for children and siblings
...
}
To implement this inteface, we could use an IntBlockPool
. We would allocate new int buffers in the block pool as needed and preserve the block pool across DirectoryTaxonomyReader
refreshes.
There are definitely some disadvantages with the block pool idea:
- We're preserving a mutable data-structure across taxonomy refreshes. There is precedent though, with the caches in
DirectoryTaxonomyReader
. - We would be slightly overallocating by having the last buffer in the pool not be completely used, but I think this is a good trade-off to take for the increased efficiency and simplicity.
What do you think, did you have something else in mind?
from lucene.
What do you think, did you have something else in mind?
Oh -- I didn't have anything in mind. I just saw the issue and thought, "Hey, I could figure out how to do that!" Sounds like you've got it in hand, though!
from lucene.
I'd be happy to work together on it! If we go the route I was proposing, there's a non-trivial amount of work to do:
- Create the new interface for taxonomy arrays and use it with taxonomy facets and in the taxo reader.
- Augment the
IntBlockPool
with conveience methods that support this new use-case and implement the new taxo array interface using theIntBlockPool
.
1 and 2 can be done independently, so we could each take one of those work streams. I'll start on it in the next few days, but feel free to jump in if you get the chance.
from lucene.
I took a look and I think we might be able to do it a little easier:
public abstract class ParallelTaxonomyArrays {
public class ChunkedArray {
private final int chunkSize;
private final int[][] chunks;
public ChunkedArray(int chunkSize, int[][] chunks) {
this.chunkSize = chunkSize;
this.chunks = chunks;
}
public int get(int i) {
int chunkNum = i / chunkSize;
return chunks[chunkNum][i - (chunkNum * chunkSize)];
}
public int length() {
return chunkSize * chunks.length;
}
}
/** Sole constructor. */
public ParallelTaxonomyArrays() {}
public abstract ChunkedArray parents();
public abstract ChunkedArray children();
public abstract ChunkedArray siblings();
}
Then within TaxonomyIndexArrays
, we can focus on building int[][]
instances that get wrapped in ChunkedArray
. I feel like IntBlockPool
might be overkill?
from lucene.
I ended up running with that idea (sort of) and implemented this: #12995
The unit tests pass, but I don't think any of them allocate more than 8192 ordinals (the size of chunk that I set).
from lucene.
Thanks @msfroh! The PR looks neat and you might be right that, while IntBlockPool
basically maintains an int[][]
like our ChunkedArray
, it is a bit inconveient to work with.
I left more detailed comments on the PR, but the high-level question is if we've actually reduced the memory footprint during taxonomy refreshes. It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.
from lucene.
It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.
What you've missed is that I'm a big dum-dum 😁
Thanks for catching that! I refactored some code into a shared method (between the "reuse old arrays" case and the "start fresh with a TaxonomyReader" case) and foolishly applied the "start fresh" logic every time. I've fixed it in a subsequent commit (allocating chunks only starting from the index of the last chunk of the old array).
I also incorporated several of the other changes that you suggested. Thanks a lot!
from lucene.
Related Issues (20)
- FSDirectory stuck at open(Path path) method when ran from .jar file HOT 6
- Use Reverse RetentionQuerySupplier to speed up SoftDeletesRetentionMergePolicy#numDeletesToMerge HOT 1
- Introduce Bloom Filter as non-experimental/core postings format HOT 6
- Update package info for HNSW
- NullPointerException in IndexSearcher.search() when searching with SpanfirstQuery and a customized collector HOT 1
- Test an FST bytes store that re-reverses (reads bytes forward) on-the-fly
- migrate OpenNLP 'ant train-test-models' to Gradle
- Exploring GPU based kNN vector search
- Add post-filter capability to `SynonymGraphFilter`
- `SynonymGraphFilter` should read FSTs off-heap? HOT 2
- org.apache.lucene.search.TestByteVectorSimilarityQuery.testApproximate failing intermittently HOT 1
- Use Kahan summation for float aggregations to reduce errors HOT 3
- DV update files referenced by merge will be deleted by concurrent flush HOT 3
- join: repeat BytesRefHash.sort() in TermsQuery after TermsIncludingScoreQuery HOT 3
- SegmentDocValuesProducer checkIntegrity might open a dropped segment HOT 1
- What if we pick up segments in segment size's ascending order in TieredMergePolicy.doFindMerges? HOT 4
- Advance to first position of 1 in BitSet before iterating the lead in BitSetConjunctionDISI? HOT 2
- Reproducible failure in TestGrouping.testRandom HOT 4
- Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lucene.