Comments (18)
Hmm, weird --- what's the context? Are you running out of GPU memory by any chance?
Also, stride
happens to indeed be defined.
buffers[device] = [torch.zeros(max_bsize, stride, self.dim, dtype=dtype,
device=device, pin_memory=(device == 'cpu'))
for stride in self.strides]
It's the for stride in self.strides
.
from colbert.
Sorry about that, my mistake :)
But regarding the error, I'm definitely not running out of gpu memory, because these tests are being run on a very small index.
from colbert.
No worries! Could you share more info, like commands or data details?
Also I just pushed a minor change. See if that makes a difference!
from colbert.
Also, your error says line 57 but is around line 52. Is the code changed?
from colbert.
I added some print statements for debugging purposes, which is why the line number is different.
I made a small index for the purpose of testing the full reranker against my own reranker, since the two are giving different results and I am trying to debug. Technically, my official code does not use the IndexRanker
at all, but this issue is still relevant for my debugging and the work of others. Weirdly, this issue only arises like half the time. You can see below that a first instantiation of IndexPart
works, while the second identical instantiation throws the error.
from colbert.
Is this the first time you use this code? Or did it work before?
from colbert.
No, I have used score_by_range
several times, which uses IndexPart
from colbert.
To your point, I never got this error until I started testing by instantiating an IndexRanker
directly. Do you think there's some environment variable or something that goes awry when I do this?
from colbert.
Oh! You're extracting the API directly, not using the command line?
Isn't that something that deserves to be mentioned upfront in the post haha? Certainly this is connected to the error yes.
from colbert.
Could you show me how you create the object, etc.? Also can you check for no errors if you use the standard way?
from colbert.
Sorry, I should've been more explicit. See the above error for an example of how I create an IndexPart
object, which throws the error about 50% of the time.
For a bit of feedback, I don't find the command line scripts particularly useful for use of the repo as part of a larger task. The scripts are useful for building the index itself, but I would prefer to call retrieval and reranking from python code. In general, this repo seems to be written in such a way that the end user is merely observing that ColBERT works, not for extending ColBERT into a stack that does retrieval and then uses the retrieved results to do other things. I wish it was written more like an API, where the retrieval and reranking scripts were transformed into, say, python classes, where you can instantiate a retriever-reranker, then call .retrieve(query_text: str, n_docs: int)
to get n_docs
relevant pids given the query. That would be perfect for my use case.
from colbert.
I wish it was written more like an API, where the retrieval and reranking scripts were transformed into, say, python classes, where you can instantiate a retriever-reranker, then call .retrieve(query_text: str, n_docs: int) to get n_docs relevant pids given the query.
Indeed, that's a goal of mine too! I think you can do this now but it's challenging. I know of a few cases where people are doing this currently.
That said, if you want to help contribute to changing up the API, I'd love to collaborate on that.
In the meanwhile, happy to help as you try to use the current setup. You should start from colbert/ranking/retrieve.py and just tweak the loop linked from the README so it uses your own queries. That works for people!
from colbert.
Regarding whether I get no errors the standard way, I have been running the standard way all day with no error that I can recall.
from colbert.
That said, if you want to help contribute to changing up the API, I'd love to collaborate on that.
I think I would be up for contributing, assuming we decide to stick with ColBERT (I think we will)
In the meanwhile, happy to help as you try to use the current setup. You should start from colbert/ranking/retrieve.py and just tweak the loop linked from the README so it uses your own queries. That works for people!
Yes, I'm doing something fairly similar currently, which works end-to-end. That said, I often want to query once at a time, which inspired a separate issue I submitted last week. This isn't very feasible given I can't load the data into memory. Because of that, I ended up having to implement a second reranking task, for reranking not-en-masse, that largely doesn't rely on this repo (which I am currently testing against the repo's reranking, leading to this current thread).
from colbert.
I have narrowed down the problem a lot:
I get the error in the following situation:
You can see that the error doesn't arise when pin_memory=False
, so it seems pin_memory
can't be true on my machine for a tensor of that size. (note: I have verified that pin_memory=True
runs successfully for a smaller first dimension size)
I'm not sure it's necessary to set that first dimension to 1<<14
. It seems like overkill in many cases. Do you agree?
from colbert.
This isn't very feasible given I can't load the data into memory.
That seems to be the root of your problems. Have you checkout our quantization stuff?
One recent thing (inspired by BPR paper, to appear at ACL'21) is that we found the ColBERT vectors to retain almost all of their quality if you just keep the sign of each dimension without any other changes. Basically, for each vector, keep 128 bits with +1 if positive and -1 [not zero] if negative. Do this just for the document side, and keep the query side as usual. Do this after you have already constructed the FAISS index from uncompressed embeddings. Interestingly, you will preserve almost all of the quality of ColBERT but the index is 16x smaller.
We want to merge this soon (but "soon" might be in a month or two) but you can you do it yourself. Just need to figure out how to store each bit at 1/0 and then cast to float16 with +1 and -1 so you can multiply with the query embeddings.
from colbert.
I'm not sure it's necessary to set that first dimension to 1<<14. It seems like overkill in many cases. Do you agree?
Yes, it's overkill for most! Smaller would be fine!
from colbert.
I have things working properly now, so thank you for your help! The only remaining issue is that my fast single-query retrieval-reranking method uses .reconstruct(id)
, which gives an approximate vector. This means my scores are slightly off from what the repo returns. This is only the case because of the product quantization, which I guess is something I have to live with (or maybe we can add handling for it in the api).
A couple more things that would be good in the interfacing class for the api (i am probably happy to help):
(1) ability to turn off those print statements, or perhaps connect them to upstream logging level
(2) fast single-query retrieval, like the one i implemented
(3) like i said above, ability to use regular ivf instead of ivfpq so that the fast single method can be exact
from colbert.
Related Issues (20)
- How to set chunk_size
- Tokens in `skiplist` are not returned (masked out) but they still affect other tokens embeddings. Is this expected? HOT 2
- How to get the mapping information about doc_id with doc_content. HOT 1
- CollectionEncoder blocking on encoder N passages HOT 1
- Focusing retrieval on list of document ids with doc_ids parameter doesn't work
- type object 'ColBERT' has no attribute 'segmented_maxsim' HOT 1
- Where is the qrels.dev.small.tsv?
- How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
- Request for AMD gpu support
- How to quickly check if installation is working fine?
- ColBert is not failing when Error is encounter during both train and indexing
- How to insert new document into the pre-built index? HOT 1
- Is there a check point of ColBERT that wasn't trained on MSMARCO?
- How to check the centroids and the data in the clusters?
- Extract only embeddings
- Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
- Results on BEIR
- unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
- Add_to_index only work first time
- Tokenization Assumption for Query Marker Replacement is Inconsistent
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.