Deion At Amazon product search we use taxonomy facets for th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I ended up running with that idea (sort of) and implemented this: <a class="issue-link

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? about lucene HOT 8 CLOSED

mikemccand commented on May 21, 2024

Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure?

from lucene.

Comments (8)

msfroh commented on May 21, 2024

If nobody else is working on this, I think I'd like to take it!

from lucene.

stefanvodita commented on May 21, 2024

@msfroh - I was looking into this as well and had some thoughts about how to do it.

We could replace ParallelTaxonomyArrays with a new interface that offers three operations for each of the arrays:

interface ChunkedParallelTaxonomyArrays {

  /* Record new entry. */
  public void appendParent(int parent);

  /* Retrieve this ordinal's parent. From the user's perspective, this is like an array look-up. */
  public int getParent(int ord);

  /* There are some places where we need to know how many parents exist in total. */
  public int sizeParents();

  // Same for children and siblings
  ...
}

To implement this inteface, we could use an IntBlockPool. We would allocate new int buffers in the block pool as needed and preserve the block pool across DirectoryTaxonomyReader refreshes.

There are definitely some disadvantages with the block pool idea:

We're preserving a mutable data-structure across taxonomy refreshes. There is precedent though, with the caches in DirectoryTaxonomyReader.
We would be slightly overallocating by having the last buffer in the pool not be completely used, but I think this is a good trade-off to take for the increased efficiency and simplicity.

What do you think, did you have something else in mind?

from lucene.

msfroh commented on May 21, 2024

What do you think, did you have something else in mind?

Oh -- I didn't have anything in mind. I just saw the issue and thought, "Hey, I could figure out how to do that!" Sounds like you've got it in hand, though!

from lucene.

stefanvodita commented on May 21, 2024

I'd be happy to work together on it! If we go the route I was proposing, there's a non-trivial amount of work to do:

Create the new interface for taxonomy arrays and use it with taxonomy facets and in the taxo reader.
Augment the IntBlockPool with conveience methods that support this new use-case and implement the new taxo array interface using the IntBlockPool.

1 and 2 can be done independently, so we could each take one of those work streams. I'll start on it in the next few days, but feel free to jump in if you get the chance.

from lucene.

msfroh commented on May 21, 2024

I took a look and I think we might be able to do it a little easier:

public abstract class ParallelTaxonomyArrays {
  public class ChunkedArray {
    private final int chunkSize;
    private final int[][] chunks;

    public ChunkedArray(int chunkSize, int[][] chunks) {
      this.chunkSize = chunkSize;
      this.chunks = chunks;
    }

    public int get(int i) {
      int chunkNum = i / chunkSize;
      return chunks[chunkNum][i - (chunkNum * chunkSize)];
    }

    public int length() {
      return chunkSize * chunks.length;
    }
  }

  /** Sole constructor. */
  public ParallelTaxonomyArrays() {}

  public abstract ChunkedArray parents();
  public abstract ChunkedArray children();
  public abstract ChunkedArray siblings();
}

Then within TaxonomyIndexArrays, we can focus on building int[][] instances that get wrapped in ChunkedArray. I feel like IntBlockPool might be overkill?

from lucene.

msfroh commented on May 21, 2024

I ended up running with that idea (sort of) and implemented this: #12995

The unit tests pass, but I don't think any of them allocate more than 8192 ordinals (the size of chunk that I set).

from lucene.

stefanvodita commented on May 21, 2024

Thanks @msfroh! The PR looks neat and you might be right that, while IntBlockPool basically maintains an int[][] like our ChunkedArray, it is a bit inconveient to work with.
I left more detailed comments on the PR, but the high-level question is if we've actually reduced the memory footprint during taxonomy refreshes. It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.

from lucene.

msfroh commented on May 21, 2024

It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.

What you've missed is that I'm a big dum-dum 😁

Thanks for catching that! I refactored some code into a shared method (between the "reuse old arrays" case and the "start fresh with a TaxonomyReader" case) and foolishly applied the "start fresh" logic every time. I've fixed it in a subsequent commit (allocating chunks only starting from the index of the last chunk of the old array).

I also incorporated several of the other changes that you suggested. Thanks a lot!

from lucene.

Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? about lucene HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs