GithubHelp home page GithubHelp logo

dmllr / fast-lopq Goto Github PK

View Code? Open in Web Editor NEW
33.0 2.0 4.0 87 KB

Fast C++ implementation of https://github.com/yahoo/lopq: Locally Optimized Product Quantization (LOPQ) model and searcher for approximate nearest neighbor search of high dimensional data.

License: Apache License 2.0

CMake 35.83% C++ 64.17%
lopq cpp17 cpp blaze cmake nearest-neighbor-search product-quantization clustering

fast-lopq's Introduction

Locally Optimized Product Quantization

Objectives

This is a C++ port of searcher component for Locally Optimized Product Quantization (LOPQ) code, designed for CPU, originally located at https://github.com/yahoo/lopq.

A model, pretrained with code provided for Python or Spark and deployed via a Protobuf format is required to, e.g., search backends for high performance approximate nearest neighbor search.

Overview

Locally Optimized Product Quantization (LOPQ) [1] is a hierarchical quantization algorithm that produces codes of configurable length for data points. These codes are efficient representations of the original vector and can be used in a variety of ways depending on the application, including as hashes that preserve locality, as a compressed vector from which an approximate vector in the data space can be reconstructed, and as a representation from which to compute an approximation of the Euclidean distance between points.

Training

Conceptually, the LOPQ quantization process can be broken into 4 phases. The training process also fits these phases to the data in the same order.

  1. The raw data vector is PCA'd to D dimensions (possibly the original dimensionality). This allows subsequent quantization to more efficiently represent the variation present in the data.
  2. The PCA'd data is then product quantized [2] by two k-means quantizers. This means that each vector is split into two subvectors each of dimension D / 2, and each of the two subspaces is quantized independently with a vocabulary of size V. Since the two quantizations occur independently, the dimensions of the vectors are permuted such that the total variance in each of the two subspaces is approximately equal, which allows the two vocabularies to be equally important in terms of capturing the total variance of the data. This results in a pair of cluster ids that we refer to as "coarse codes".
  3. The residuals of the data after coarse quantization are computed. The residuals are then locally projected independently for each coarse cluster. This projection is another application of PCA and dimension permutation on the residuals, and it is "local" in the sense that there is a different projection for each cluster in each of the two coarse vocabularies. These local rotations make the next and final step, another application of product quantization, very efficient in capturing the variance of the residuals.
  4. The locally projected data is then product quantized a final time by M subquantizers, resulting in M "fine codes". Usually the vocabulary for each of these subquantizers will be a power of 2 for effective storage in a search index. With vocabularies of size 256, the fine codes for each indexed vector will require M bytes to store in the index.

The final LOPQ code for a vector is a (coarse codes, fine codes) pair, e.g. ((3, 2), (14, 164, 83, 49, 185, 29, 196, 250)).

Training code is not included in this repository. Please use original Python code for training process.

Nearest Neighbor Search

A nearest neighbor index can be built from these LOPQ codes by indexing each document into its corresponding coarse code bucket. That is, each pair of coarse codes (which we refer to as a "cell") will index a bucket of the vectors quantizing to that cell.

At query time, an incoming query vector undergoes substantially the same process. First, the query is split into coarse subvectors and the distance to each coarse centroid is computed. These distances can be used to efficiently compute a priority-ordered sequence of cells [3] such that cells later in the sequence are less likely to have near neighbors of the query than earlier cells. The items in cell buckets are retrieved in this order until some desired quota has been met.

After this retrieval phase, the fine codes are used to rank by approximate Euclidean distance. The query is projected into each local space and the distance to each indexed item is estimated as the sum of the squared distances of the query subvectors to the corresponding subquantizer centroids indexed by the fine codes.

NN search with LOPQ is highly scalable and has excellent properties in terms of both index storage requirements and query-time latencies when implemented well.

References

More information and performance benchmarks can be found at http://image.ntua.gr/iva/research/lopq/.

Python's code implementation can be found at https://github.com/yahoo/lopq.

  1. Y. Kalantidis, Y. Avrithis. Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. CVPR 2014.
  2. H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. PAMI, 33(1), 2011.
  3. A. Babenko and V. Lempitsky. The inverted multi-index. CVPR 2012.

License

Code licensed under the Apache License, Version 2.0 license. See LICENSE file for terms.

fast-lopq's People

Contributors

dmllr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fast-lopq's Issues

save index method

can you please share save_index method from python as I am not seeing it in lopq repo?
or explain the idea?
another question about meta, why we need this meta for filtration?
Thanks

HELP! Cmake error

I am looking for LOPQ made with C++ rather than LOPQ made in Python language.
Then I found Fast-LOPQ.

I'm having a hard time building this using CMake.

Is there a way to solve this?

Selecting Windows SDK version 10.0.18362.0 to target Windows 10.0.19041.
Configuring done

CMake Error at cmake/ext-protobuf.cmake:175 (add_library):
Cannot find source file:
ext/protobuf/src/google/protobuf/compiler/code_generator.cc
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
.hpp .hxx .in .txx
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

CMake Error at cmake/ext-protobuf.cmake:5 (add_library):
Cannot find source file:
ext/protobuf/src/google/protobuf/any_lite.cc
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
.hpp .hxx .in .txx
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

CMake Error at cmake/ext-protobuf.cmake:282 (add_executable):
Cannot find source file:
ext/protobuf/src/google/protobuf/compiler/main.cc
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
.hpp .hxx .in .txx
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

CMake Error at cmake/ext-protobuf.cmake:175 (add_library):
No SOURCES given to target: libprotoc
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

CMake Error at cmake/ext-protobuf.cmake:5 (add_library):
No SOURCES given to target: libprotobuf
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

CMake Error at cmake/ext-protobuf.cmake:282 (add_executable):
No SOURCES given to target: protoc
Call Stack (most recent call first):
CMakeLists.txt:18 (include)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.