GithubHelp home page GithubHelp logo

Comments (18)

okhat avatar okhat commented on June 29, 2024

Hmm, weird --- what's the context? Are you running out of GPU memory by any chance?

Also, stride happens to indeed be defined.

            buffers[device] = [torch.zeros(max_bsize, stride, self.dim, dtype=dtype,
                                           device=device, pin_memory=(device == 'cpu'))
                               for stride in self.strides]

It's the for stride in self.strides.

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

Sorry about that, my mistake :)

But regarding the error, I'm definitely not running out of gpu memory, because these tests are being run on a very small index.

from colbert.

okhat avatar okhat commented on June 29, 2024

No worries! Could you share more info, like commands or data details?

Also I just pushed a minor change. See if that makes a difference!

from colbert.

okhat avatar okhat commented on June 29, 2024

Also, your error says line 57 but is around line 52. Is the code changed?

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

I added some print statements for debugging purposes, which is why the line number is different.

I made a small index for the purpose of testing the full reranker against my own reranker, since the two are giving different results and I am trying to debug. Technically, my official code does not use the IndexRanker at all, but this issue is still relevant for my debugging and the work of others. Weirdly, this issue only arises like half the time. You can see below that a first instantiation of IndexPart works, while the second identical instantiation throws the error.
Screen Shot 2021-06-14 at 3 00 53 PM

from colbert.

okhat avatar okhat commented on June 29, 2024

Is this the first time you use this code? Or did it work before?

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

No, I have used score_by_range several times, which uses IndexPart

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

To your point, I never got this error until I started testing by instantiating an IndexRanker directly. Do you think there's some environment variable or something that goes awry when I do this?

from colbert.

okhat avatar okhat commented on June 29, 2024

Oh! You're extracting the API directly, not using the command line?

Isn't that something that deserves to be mentioned upfront in the post haha? Certainly this is connected to the error yes.

from colbert.

okhat avatar okhat commented on June 29, 2024

Could you show me how you create the object, etc.? Also can you check for no errors if you use the standard way?

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

Sorry, I should've been more explicit. See the above error for an example of how I create an IndexPart object, which throws the error about 50% of the time.

For a bit of feedback, I don't find the command line scripts particularly useful for use of the repo as part of a larger task. The scripts are useful for building the index itself, but I would prefer to call retrieval and reranking from python code. In general, this repo seems to be written in such a way that the end user is merely observing that ColBERT works, not for extending ColBERT into a stack that does retrieval and then uses the retrieved results to do other things. I wish it was written more like an API, where the retrieval and reranking scripts were transformed into, say, python classes, where you can instantiate a retriever-reranker, then call .retrieve(query_text: str, n_docs: int) to get n_docs relevant pids given the query. That would be perfect for my use case.

from colbert.

okhat avatar okhat commented on June 29, 2024

I wish it was written more like an API, where the retrieval and reranking scripts were transformed into, say, python classes, where you can instantiate a retriever-reranker, then call .retrieve(query_text: str, n_docs: int) to get n_docs relevant pids given the query.

Indeed, that's a goal of mine too! I think you can do this now but it's challenging. I know of a few cases where people are doing this currently.

That said, if you want to help contribute to changing up the API, I'd love to collaborate on that.

In the meanwhile, happy to help as you try to use the current setup. You should start from colbert/ranking/retrieve.py and just tweak the loop linked from the README so it uses your own queries. That works for people!

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

Regarding whether I get no errors the standard way, I have been running the standard way all day with no error that I can recall.

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

That said, if you want to help contribute to changing up the API, I'd love to collaborate on that.

I think I would be up for contributing, assuming we decide to stick with ColBERT (I think we will)

In the meanwhile, happy to help as you try to use the current setup. You should start from colbert/ranking/retrieve.py and just tweak the loop linked from the README so it uses your own queries. That works for people!

Yes, I'm doing something fairly similar currently, which works end-to-end. That said, I often want to query once at a time, which inspired a separate issue I submitted last week. This isn't very feasible given I can't load the data into memory. Because of that, I ended up having to implement a second reranking task, for reranking not-en-masse, that largely doesn't rely on this repo (which I am currently testing against the repo's reranking, leading to this current thread).

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

I have narrowed down the problem a lot:

I get the error in the following situation:

Screen Shot 2021-06-14 at 6 09 51 PM

You can see that the error doesn't arise when pin_memory=False, so it seems pin_memory can't be true on my machine for a tensor of that size. (note: I have verified that pin_memory=True runs successfully for a smaller first dimension size)

I'm not sure it's necessary to set that first dimension to 1<<14. It seems like overkill in many cases. Do you agree?

from colbert.

okhat avatar okhat commented on June 29, 2024

This isn't very feasible given I can't load the data into memory.

That seems to be the root of your problems. Have you checkout our quantization stuff?

One recent thing (inspired by BPR paper, to appear at ACL'21) is that we found the ColBERT vectors to retain almost all of their quality if you just keep the sign of each dimension without any other changes. Basically, for each vector, keep 128 bits with +1 if positive and -1 [not zero] if negative. Do this just for the document side, and keep the query side as usual. Do this after you have already constructed the FAISS index from uncompressed embeddings. Interestingly, you will preserve almost all of the quality of ColBERT but the index is 16x smaller.

We want to merge this soon (but "soon" might be in a month or two) but you can you do it yourself. Just need to figure out how to store each bit at 1/0 and then cast to float16 with +1 and -1 so you can multiply with the query embeddings.

from colbert.

okhat avatar okhat commented on June 29, 2024

I'm not sure it's necessary to set that first dimension to 1<<14. It seems like overkill in many cases. Do you agree?

Yes, it's overkill for most! Smaller would be fine!

from colbert.

JamesDeAntonis avatar JamesDeAntonis commented on June 29, 2024

I have things working properly now, so thank you for your help! The only remaining issue is that my fast single-query retrieval-reranking method uses .reconstruct(id), which gives an approximate vector. This means my scores are slightly off from what the repo returns. This is only the case because of the product quantization, which I guess is something I have to live with (or maybe we can add handling for it in the api).

A couple more things that would be good in the interfacing class for the api (i am probably happy to help):
(1) ability to turn off those print statements, or perhaps connect them to upstream logging level
(2) fast single-query retrieval, like the one i implemented
(3) like i said above, ability to use regular ivf instead of ivfpq so that the fast single method can be exact

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.