GithubHelp home page GithubHelp logo

josehu07 / diskann Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microsoft/diskann

0.0 0.0 0.0 3.82 MB

Scalable graph based indices for approximate nearest neighbor search

License: Other

Shell 0.57% C++ 96.18% C 0.03% CMake 2.09% Dockerfile 0.03% Python 1.10%

diskann's Introduction

DiskANN with TensorStore Backend

UW-Madison CS744, Fall 2022

TensorStoreANN

Benefits of using TensorStore as the index storage backend:

  • Shareable index files across multiple array formats with a uniform API
  • Asynchronous I/O for high-throughput access
  • Automatic handling of data caching
  • Controlled concurrent I/O with remote storage backend

Build

On a CloudLab Ubuntu 20.04 machine:

  • Install necessary DiskANN dependencies (see original README below)
  • Install gcc suite version >=10.x and set as default
  • Install cmake version >= 3.24
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)

Note that Internet connection is required for the build, as the CMake involves Google's FetchContent utility, which will download tensorstore from our forked GitHub repo and its dependencies over the network.

Run

For help messages:

./scripts/run.py [subcommand] -h

Parse Sift dataset fvecs into fbin format:

./scripts/run.py to_fbin --sift_base /mnt/ssd/data/sift-small/siftsmall --dataset /mnt/ssd/data/sift-tiny/sifttiny [--max_npts 1000]

Build on-disk index from learning input (may take very long):

./scripts/run.py build --dataset /mnt/ssd/data/sift-tiny/sifttiny

Convert on-disk index to zarr format tensors:

./scripts/run.py convert --dataset /mnt/ssd/data/sift-tiny/sifttiny

Run queries with different parameters:

./scripts/run.py query --dataset /mnt/ssd/data/sift-tiny/sifttiny [--k_depth 10] [--npts_to_cache 100] [--use_ts] [--ts_async] [-L 10 50 100]

Automated wrapper for run.py:

./scripts/run_tests.sh /mnt/ssd/data/gist/gist /mnt/ssd/result/gist
# the first argument is a path prefix of `*_learn.fbin`
# the second argument is the log directory

Bar graph plotting with run.py wrapper generated data:

./scripts/plot.py /mnt/ssd/result/gist /mnt/ssd/result/gist/plots

To run TensorStore with remote http server, create another node (assume IP address 10.10.1.2) and launch a http server:

# at the parent directory of gist/
python3 -m http.server  # this will use 8000 port

Then in the previous node, run the script with remote address specified:

./scripts/run.py query --dataset /mnt/ssd/data/gist/gist --k_depth 10 --list_sizes 10 --use_ts --use_remote http://10.10.1.2:8000/gist/gist

This will load query from local and use TensorStore on http://10.10.1.2:8000 server.

TODO List

  • Converter from disk index to zarr tensors
  • Search path tensorstore reader integration
  • Allow turning on/off async I/O patterns for comparison
  • Allow turning on/off tensorstore cache pool for comparison (currently sees no effect, needs further study)
  • Using a remote storage backend

DiskANN - Original README

The goal of the project is to build scalable, performant, streaming and cost-effective approximate nearest neighbor search algorithms for trillion-scale vector search. This release has the code from the DiskANN paper published in NeurIPS 2019, the streaming DiskANN paper and improvements. This code reuses and builds upon some of the code for NSG algorithm.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

See guidelines for contributing to this project.

Linux build:

Install the following packages through apt-get

sudo apt install make cmake g++ libaio-dev libgoogle-perftools-dev clang-format libboost-all-dev

Install Intel MKL

Ubuntu 20.04

sudo apt install libmkl-full-dev

Earlier versions of Ubuntu

Install Intel MKL either by downloading the oneAPI MKL installer or using apt (we tested with build 2019.4-070 and 2022.1.2.146).

# OneAPI MKL Installer
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18487/l_BaseKit_p_2022.1.2.146.sh
sudo sh l_BaseKit_p_2022.1.2.146.sh -a --components intel.oneapi.lin.mkl.devel --action install --eula accept -s

Build

mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release .. && make -j 

Windows build:

The Windows version has been tested with Enterprise editions of Visual Studio 2022, 2019 and 2017. It should work with the Community and Professional editions as well without any changes.

Prerequisites:

  • CMake 3.15+ (available in VisualStudio 2019+ or from https://cmake.org)
  • NuGet.exe (install from https://www.nuget.org/downloads)
    • The build script will use NuGet to get MKL, OpenMP and Boost packages.
  • DiskANN git repository checked out together with submodules. To check out submodules after git clone:
git submodule init
git submodule update
  • Environment variables:
    • [optional] If you would like to override the Boost library listed in windows/packages.config.in, set BOOST_ROOT to your Boost folder.

Build steps:

  • Open the "x64 Native Tools Command Prompt for VS 2019" (or corresponding version) and change to DiskANN folder
  • Create a "build" directory inside it
  • Change to the "build" directory and run
cmake ..

OR for Visual Studio 2017 and earlier:

<full-path-to-installed-cmake>\cmake ..
  • This will create a diskann.sln solution. Open it from VisualStudio and build either Release or Debug configuration.
    • Alternatively, use MSBuild:
msbuild.exe diskann.sln /m /nologo /t:Build /p:Configuration="Release" /property:Platform="x64"
* This will also build gperftools submodule for libtcmalloc_minimal dependency.
  • Generated binaries are stored in the x64/Release or x64/Debug directories.

Usage:

Please see the following pages on using the compiled code:

diskann's People

Contributors

chenhao-ye avatar daxpryce avatar dengcai78 avatar harsha-simhadri avatar hliu18 avatar jigaoluo avatar josehu07 avatar kiwichicken avatar ltan1ms avatar microsoft-github-operations[bot] avatar microsoftopensource avatar philipbadams avatar rakri avatar shanewil avatar shikharj avatar theantony avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.