criteo / autofaiss Goto Github PK
View Code? Open in Web Editor NEWAutomatically create Faiss knn indices with the most optimal similarity search parameters.
Home Page: https://criteo.github.io/autofaiss/
License: Apache License 2.0
Automatically create Faiss knn indices with the most optimal similarity search parameters.
Home Page: https://criteo.github.io/autofaiss/
License: Apache License 2.0
autofaiss/autofaiss/external/optimize.py
Lines 139 to 159 in d5c773f
While adding parts to the index, compute the ground truth on each part and merge them at the end.
Make it possible to compute the recall without loading the embeddings twice
first step that is worse: add an option to compute the medium metrics with the current state of the code
if enabled, avoid doing IVFFlat large trainings
I am using autofaiss 2.14.0 and it works for some parts of the data I am working on, but not for some. I keep getting this error and I do not know where to look at:
2022-04-21 17:46:40,649 [INFO]: There are 16325691 embeddings of dim 768
2022-04-21 17:46:40,653 [INFO]: >>> Finished "Reading total number of vectors and dimension" in 37.7308 secs
2022-04-21 17:46:40,653 [INFO]: Compute estimated construction time of the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,659 [INFO]: -> Train: 16.7 minutes
2022-04-21 17:46:40,659 [INFO]: -> Add: 2.3 minutes
2022-04-21 17:46:40,659 [INFO]: Total: 19.0 minutes
2022-04-21 17:46:40,659 [INFO]: >>> Finished "Compute estimated construction time of the index" in 0.0057 secs
2022-04-21 17:46:40,659 [INFO]: Checking that your have enough memory available to create the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,802 [INFO]: >>> Finished "Checking that your have enough memory available to create the index" in 0.1431 secs
2022-04-21 17:46:40,803 [INFO]: >>> Finished "Launching the whole pipeline" in 37.8808 secs
Traceback (most recent call last):
File "process.py", line 26, in <module>
chunks_to_precalculated_knn_(
File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 373, in chunks_to_precalculated_knn_
index, embeddings = chunks_to_index_and_embed(
File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 334, in chunks_to_index_and_embed
index = index_embeddings(
File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 288, in index_embeddings
build_index(
File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 224, in build_index
necessary_mem, index_key_used = estimate_memory_required_for_index_creation(
File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/build.py", line 46, in estimate_memory_required_for_index_creation
index_key = get_optimal_index_keys_v2(
IndexError: list index out of range
simply call faiss.extract_index_ivf(index).set_direct_map_type(faiss.DirectMap.Array)
under an option before starting the .add there https://github.com/criteo/autofaiss/blob/master/autofaiss/external/build.py#L171
s3 and hdfs are low latency high bandwidth file systems
On these fs, fetching files sequentially is slow
Today our embedding iterator read files sequentially
This could be made faster by reading files in parallel or even parts of files in parallel using pyarrow readers that includes threads internally
Hi, thanks for this library, it really helps, when working with faiss! One minor problem I have is that I would like to control the verbosity of the messages, since I use this autofaiss in my own library. The simplest way to do that would probably through the use of python's logging module.
Is there anything planned in that regard?
Currently merging takes a while because retrieving thousands of indices (with overhead) in a single node can be slow
this can be made faster by doing a 2 stage merge
import numpy as np
from autofaiss import build_index
embeddings = np.float32(np.random.rand(700, 700))
build_index(
embeddings=embeddings, # type: ignore
index_path="knn.index",
index_infos_path="infos.json",
should_be_memory_mappable=True,
use_gpu=True,
)
On my A100, the use_gpu=True breaks the flow.
Currently the flow is:
It works well but requires a large amount of disk space
It's possible to instead do download -> convert -> add for each part of the embedding collection (and remove temporary files when doing the next part)
One way to do this could be to opensource the pyspark job doing this
It could also be possible to implement this directly in python here.
A simple way to do this could also be to have better support of remote file systems directly in quantize.
Currently merging in distributed mode requires to store the whole index in memory
Possible strategies:
TemporaryDirectory
is a local folder which may not have any room
the user should specify what is the temporary folder (in fact we already have an option for this)
Right now there does not seem to be an easy way to take an already-built index and add more embeddings to it (from the same distribution). This is obviously already indirectly supported by / possible with autofaiss because distributed training already does it, and also it is something easily supported by FAISS backbone. But I wonder if we can expose an easy interface to take a built index and add more features from a new set of embeddings (Using all the bells and whistles provided by autofaiss/embedding-reader for reading embeddings from a numpy-parquet format). Perhaps a update_index
interface?
Thanks!
to avoid reading the embeddings parquet a second time, we could consider extracting, yielding and saving the keys from the parquet files in the read embeddings function.
These keys could be saved either as parquet, either in some format convenient for fast random access (eg arrow, hdf5 for one way, leveldb for 2 way).
That would probably be convenient but let's keep this for another PR
(Another option is to do this in another utility that would read only the key column, to be seen what is best)
When we produce N indices (with nb_indices_to_keep
larger than 1), within the function of optimize_and_measure_indices
, we download N indices from remote in one shot (see here), if the machine running autofaiss has limited disk space, it would fail due to No space left
error.
https://github.com/horovod/horovod/blob/386be429b1417a1f6cb5e715bbe36efd2e74f402/horovod/spark/runner.py#L244 is a good trick to let the user build his own spark context
python 3.8.12
autofaiss 2.13.2 pypi_0 pypi
faiss-cpu 1.7.2 pypi_0 pypi
libfaiss 1.7.2 h2bc3f7f_0_cpu pytorch
First of all, thank you for the great project! I get the error: module 'faiss' has no attribute 'swigfaiss'
when running the following command:
import autofaiss
autofaiss.build_index(
"embeddings.npy",
"autofaiss.index",
"autofaiss.json",
metric_type="ip",
should_be_memory_mappable=True,
make_direct_map=True)
The error appears when running it for make_direct_map=True
.
Tested using conda 4.11.0
or mamba 0.15.3
using pytorch
or conda-forge
channel.
for example if we want only ivf
this would require some c++ code, but it seems possible to have one stage filtering directly with faiss by overriding that scanner to add a masking feature (for example mask to keep only samples of a given language, or a given partner)
just tried it and the new estimation at https://github.com/criteo/autofaiss/pull/81/files doesn't fully capture the memory needed for training
when training an index such as OPQ32_224,IVF131072_HNSW32,PQ32x8
faiss trains the index in 2 steps
The first step seems to be indeed using the memory assumed by the current estimation (for example 21.5GB for 11M vectors of dimension 512) but then the second step uses some more ram.
I am not sure yet what are these 2 steps, but I'd guess something like a primary then secondary index
Let's figure it out then add some more tests for this (could be scheduled tests instead of tests that run for every commit)
Currently efSearch is done in 2 stage:
same for the evaluation set
currently we use the first N vectors for both training and evaluation which is not ideal, especially if the embedding set is not randomly shuffled
I have trained 3 different index and every time, my 1-recall@20 are exactly the same:
INFO:autofaiss: 1-recall@20: 0.802
INFO:autofaiss: 1-recall@40: 0.824
But there is some variation in the 20-recall and 40-recall scores.
3 digits of exactitude is too much.
What do you think about it?
Hi! I have a numpy matrices that saved as npz files. Unfortunately Autofaiss support only npy. Can you add that functionality?
Hello, I'm encountering an issue using autofaiss with flat indexes.
build_index
raises an error (in my case, when embeddings are ndarray, I did not test with parquet embeddings) in distributed mode, for flat indexes. This error could be related to facebookresearch/faiss#1212 (method index.add_with_ids is not implemented for flat indexes).
from autofaiss import build_index
build_index(
embeddings=np.ones((100, 512)),
distributed="pyspark",
should_be_memory_mappable=True,
index_path="hdfs://root/user/foo/knn.index",
index_key="Flat",
nb_cores=20,
max_index_memory_usage="32G",
current_memory_available="48G",
ids_path="hdfs://root/user/foo/test_indexing_out/ids",
temporary_indices_folder="hdfs://root/user/foo/indices/tmp/",
nb_indices_to_keep=5,
index_infos_path="hdfs://root/user/r.laby/test_indexing_out/index_infos.json",
)
raises
RuntimeError: Error in virtual void faiss::Index::add_with_ids(faiss::Index::idx_t, const float*, const idx_t*) at /project/faiss/faiss/Index.cpp:39: add_with_ids not implemented for this type of index
Is it expected ? Or could this be fixed ?
Thanks !
Merging could be made more efficient by not storing the training parameters (centroids) in each small index
This could reduce intermediary small indices size from 400GB to 40GB for 3B items
and make #55 not as necessary
this would make it possible to have stronger guaranties about how much memories autofaiss would use
that does not require the user to save their embedding to use autofaiss
make an example.py out of it
I wanted to build IVF262144_HNSW32,SQ8, but the estimation of the required memory was way too big.
might also be done for the index itself
would unlock training with more points and building larger indices with a lower memory
I want to ask whether doing KNN search with torch tensors is supported? Many thanks!
embeddings_path : str
Local path containing all preprocessed vectors and cached files.
Files will be added if empty.
output_path: str
Destination path of the quantized model on local machine.
index_key: Optinal(str)
Optional string to give to the index factory in order to create the index.
If None, an index is chosen based on an heuristic.
index_param: Optional(str)
Optional string with hyperparameters to set to the index.
If None, the hyper-parameters are chosen based on an heuristic.
max_index_query_time_ms: float
Bound on the query time for KNN search, this bound is approximative
max_index_memory_usage: str
Maximum size allowed for the index, this bound is strict
current_memory_available: str
Memory available on the machine creating the index, having more memory is a boost
because it reduces the swipe between RAM and disk.
use_gpu: bool
Experimental, gpu training is faster, not tested so far
metric_type: str
Similarity function used for query:
- "ip" for inner product
- "l2" for euclidian distance
it takes many minutes to run it
Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.
I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?
Hi!
According to the docs faiss doens't natively support cosine similarity as distance metric. The closest one is inner product which additionaly needs to prenormalize embedding vectors. In FAQ authors propose a way to do it manually with their function faiss.normalize_L2
.
I have exactly the same case and would be glad, if autofaiss have an optional flag which additionally prenormalize vectors before building index.
It seems to me that it's not so difficult and ones should add faiss.normalize_L2
to each place where iterate over embedding_reader. If so i can make a PR.
the strategy to create a few small indices the memory usage during adding and (if using the special merge on disk function) completely cap the memory used by autofaiss in general, making it possible to create arbitrarily big indices with a fixed amount of ram
let's use that strategy not only for pyspark mode, but even for the normal mode
adding N indices to normal mode should also be possible by reusing the code from distributed
some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp
Hi! Thank you for the great project! Unfortunately I'm experiencing some issues, which could be caused by Windows (10 Pro) and I'm not sure how to solve them.
I installed autofaiss with conda into a new env with Python 3.6. First, I had problems with import:
ImportError: DLL load failed while importing _swigfaiss: The specified module could not be found.
I solved that by first installing openblass, numpy and faiss from conda-forge:
conda create --name faiss_env python=3.6
conda activate faiss_env
conda install conda-forge::blas=*=openblas
conda install -c conda-forge numpy
conda install -c conda-forge faiss
pip install autofaiss
Then I tried to run the example from README, but I have encountered an error in embedding_reader:
~\.conda\envs\faiss_env\lib\site-packages\embedding_reader\get_file_list.py in _get_file_list(path, file_format, sort_result)
42 path = make_path_absolute(path)
43 fs, path_in_fs = fsspec.core.url_to_fs(path)
---> 44 prefix = path[: path.index(path_in_fs)]
ValueError: substring not found
I found out that the problem is in the fsspec.core.url_to_fs
method, namely in the private method _strip_protocol
on the line 402 in fsspec\core.py:
urlpath = fs._strip_protocol(url)
This line changes backward slashes to forward slashes and therefore the substring path_in_fs
is not found in the string path
.
Now comes the incomprehensible part: when I changed the private method _strip_protocol
to general method strip_protocol
(I only deleted the leading underscore), the ValueError disapeared and the function preserved backward slashes in the path... but then another error appeared:
RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *) at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98: Error: 'f' failed: could not open C:\Users\USER\AppData\Local\Temp\tmp2jqscc1t for writing: Permission denied
This seems to me like the problem with parallelization and I don't know how to solve it. I suppose that the solution of the ValueError was not the correct one and there is still some problem with Windows implementation.
Can you give me some advice how to find out a solution to this?
Thanks!
I am not sure whether I misunderstand something or there is an error, but when building my index with autofaiss is written Train: 16.7 minutes
but takes ~11 secs Finished "Launching the whole pipeline" in 11.1440 secs
?
Using 16 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 01/28/2022, 08:15:47
There are 4269 embeddings of dim 1024
Compute estimated construction time of the index 01/28/2022, 08:15:47
-> Train: 16.7 minutes
-> Add: 0.0 seconds
Total: 16.7 minutes
>>> Finished "Compute estimated construction time of the index" in 0.0000 secs
Checking that your have enough memory available to create the index 01/28/2022, 08:15:47
20.6MB of memory will be needed to build the index (more might be used if you have more)
>>> Finished "Checking that your have enough memory available to create the index" in 0.0009 secs
Selecting most promising index types given data characteristics 01/28/2022, 08:15:47
>>> Finished "Selecting most promising index types given data characteristics" in 0.0000 secs
Creating the index 01/28/2022, 08:15:47
-> Instanciate the index HNSW15 01/28/2022, 08:15:47
>>> Finished "-> Instanciate the index HNSW15" in 0.0036 secs
The index size will be approximately 17.2MB
The memory available for adding the vectors is 7.0GB(total available - used by the index)
Will be using at most 1GB of ram for adding
-> Adding the vectors to the index 01/28/2022, 08:15:47
Using a batch size of 244140 (memory overhead 953.7MB)
100%|██████████| 1/1 [00:00<00:00, 74.53it/s] >>> Finished "-> Adding the vectors to the index" in 0.1602 secs
>>> Finished "Creating the index" in 0.1647 secs
Computing best hyperparameters 01/28/2022, 08:15:47
>>> Finished "Computing best hyperparameters" in 3.3091 secs
The best hyperparameters are: efSearch=21
Compute fast metrics 01/28/2022, 08:15:50
2000
>>> Finished "Compute fast metrics" in 7.6499 secs
Saving the index on local disk 01/28/2022, 08:15:58
>>> Finished "Saving the index on local disk" in 0.0091 secs
Recap:
{'99p_search_speed_ms': 30.39110283832997,
'avg_search_speed_ms': 3.7983315605670214,
'compression ratio': 0.9678652870286923,
'index_key': 'HNSW15',
'index_param': 'efSearch=21',
'nb vectors': 4269,
'reconstruction error %': 0.0,
'size in bytes': 18066382,
'vectors dimension': 1024}
>>> Finished "Launching the whole pipeline" in 11.1440 secs
INFO:autofaiss: Computing best hyperparameters for index faiss_titles.faiss 05/05/2022, 07:16:53
WARNING:autofaiss:The maximum nearest neighbors coverage is 10.65% for this index. It means that when requesting 20 nearest neighbors, the average number of retrieved neighbors will be 2. The program will try to find the best hyperparameters to reach 95% of this max coverage at least, and then will optimize the search time for this target. The index search speed could be higher than the requested max search speed.
What can we do to prevent this?
This happened with "OPQ768_768,IVF262144_HNSW32,PQ768x8" -> bad max coverage
With the index_key "OPQ768_768,IVF262144_HNSW32,PQ768x4fsr", everything was ok. The vectors were just a bit too compressed.
My d is 768.
Thank you
it would decrease significantly the 8 byte overhead of each item
Storing 2^63 items in an index is not possible
Hello,
I'm currently running a workflow in argo which is generating several embedding files, in parallel, based on a database search.
If no data was found, the workflow returns a empty numpy file:
np.save(os.path.join(output, "features", filename), np.empty(0, np.float32))
Sadly the build_index
is not capable of handling those files:
Using 4 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 04/08/2022, 09:54:53
Reading total number of vectors and dimension 04/08/2022, 09:54:53
0%| | 0/16 [00:00<?, ?it/s]
19%|█▉ | 3/16 [00:00<00:00, 29.92it/s]
56%|█████▋ | 9/16 [00:00<00:00, 87.73it/s]
>>> Finished "Reading total number of vectors and dimension" in 0.1517 secs
>>> Finished "Launching the whole pipeline" in 0.1517 secs
Traceback (most recent call last):
File "/usr/local/bin/autofaiss", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 395, in main
fire.Fire({"build_index": build_index, "tune_index": tune_index, "score_index": score_index})
File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 143, in build_index
nb_vectors, vec_dim = read_total_nb_vectors_and_dim(
File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 258, in read_total_nb_vectors_and_dim
for c in p.imap_unordered(file_to_line_count, file_paths):
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 252, in file_to_line_count
return matrix_reader.get_row_count()
File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 101, in get_row_count
return self.get_shape()[0]
Would be great if it could handle it, by just showing a waning in the logs or a flag to allow it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.