GithubHelp home page GithubHelp logo

Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure? about lucene HOT 8 CLOSED

mikemccand avatar mikemccand commented on May 21, 2024
Taxonomy facets: can we change massive `int[]` for parent/child/sibling tree to paged/block `int[]` to reduce RAM pressure?

from lucene.

Comments (8)

msfroh avatar msfroh commented on May 21, 2024

If nobody else is working on this, I think I'd like to take it!

from lucene.

stefanvodita avatar stefanvodita commented on May 21, 2024

@msfroh - I was looking into this as well and had some thoughts about how to do it.

We could replace ParallelTaxonomyArrays with a new interface that offers three operations for each of the arrays:

interface ChunkedParallelTaxonomyArrays {

  /* Record new entry. */
  public void appendParent(int parent);

  /* Retrieve this ordinal's parent. From the user's perspective, this is like an array look-up. */
  public int getParent(int ord);

  /* There are some places where we need to know how many parents exist in total. */
  public int sizeParents();

  // Same for children and siblings
  ...
}

To implement this inteface, we could use an IntBlockPool. We would allocate new int buffers in the block pool as needed and preserve the block pool across DirectoryTaxonomyReader refreshes.

There are definitely some disadvantages with the block pool idea:

  1. We're preserving a mutable data-structure across taxonomy refreshes. There is precedent though, with the caches in DirectoryTaxonomyReader.
  2. We would be slightly overallocating by having the last buffer in the pool not be completely used, but I think this is a good trade-off to take for the increased efficiency and simplicity.

What do you think, did you have something else in mind?

from lucene.

msfroh avatar msfroh commented on May 21, 2024

What do you think, did you have something else in mind?

Oh -- I didn't have anything in mind. I just saw the issue and thought, "Hey, I could figure out how to do that!" Sounds like you've got it in hand, though!

from lucene.

stefanvodita avatar stefanvodita commented on May 21, 2024

I'd be happy to work together on it! If we go the route I was proposing, there's a non-trivial amount of work to do:

  1. Create the new interface for taxonomy arrays and use it with taxonomy facets and in the taxo reader.
  2. Augment the IntBlockPool with conveience methods that support this new use-case and implement the new taxo array interface using the IntBlockPool.

1 and 2 can be done independently, so we could each take one of those work streams. I'll start on it in the next few days, but feel free to jump in if you get the chance.

from lucene.

msfroh avatar msfroh commented on May 21, 2024

I took a look and I think we might be able to do it a little easier:

public abstract class ParallelTaxonomyArrays {
  public class ChunkedArray {
    private final int chunkSize;
    private final int[][] chunks;

    public ChunkedArray(int chunkSize, int[][] chunks) {
      this.chunkSize = chunkSize;
      this.chunks = chunks;
    }

    public int get(int i) {
      int chunkNum = i / chunkSize;
      return chunks[chunkNum][i - (chunkNum * chunkSize)];
    }

    public int length() {
      return chunkSize * chunks.length;
    }
  }

  /** Sole constructor. */
  public ParallelTaxonomyArrays() {}

  public abstract ChunkedArray parents();
  public abstract ChunkedArray children();
  public abstract ChunkedArray siblings();
}

Then within TaxonomyIndexArrays, we can focus on building int[][] instances that get wrapped in ChunkedArray. I feel like IntBlockPool might be overkill?

from lucene.

msfroh avatar msfroh commented on May 21, 2024

I ended up running with that idea (sort of) and implemented this: #12995

The unit tests pass, but I don't think any of them allocate more than 8192 ordinals (the size of chunk that I set).

from lucene.

stefanvodita avatar stefanvodita commented on May 21, 2024

Thanks @msfroh! The PR looks neat and you might be right that, while IntBlockPool basically maintains an int[][] like our ChunkedArray, it is a bit inconveient to work with.
I left more detailed comments on the PR, but the high-level question is if we've actually reduced the memory footprint during taxonomy refreshes. It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.

from lucene.

msfroh avatar msfroh commented on May 21, 2024

It's very possible I'm missing something, but right now it looks to me like we haven't improved on that front. Doing shallow copies of the old array without allocating new memory would solve it though.

What you've missed is that I'm a big dum-dum 😁

Thanks for catching that! I refactored some code into a shared method (between the "reuse old arrays" case and the "start fresh with a TaxonomyReader" case) and foolishly applied the "start fresh" logic every time. I've fixed it in a subsequent commit (allocating chunks only starting from the index of the last chunk of the old array).

I also incorporated several of the other changes that you suggested. Thanks a lot!

from lucene.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.