GithubHelp home page GithubHelp logo

jasonjmcghee / portable-hnsw Goto Github PK

View Code? Open in Web Editor NEW
76.0 3.0 2.0 110.47 MB

What if an HNSW index was just a file, and you could serve it from a CDN, and search it directly in the browser?

Home Page: https://jasonjmcghee.github.io/portable-hnsw/

License: MIT License

Python 41.91% HTML 48.10% CSS 9.99%
hnsw knn portable similarity-search

portable-hnsw's Introduction

Portable HNSW

To build your own index:

poetry install
poetry run python build_index.py <path to text file> [output folder]

Or you can jump into the code and do more complex use cases.

Then throw it in a GitHub repo and enable GitHub Pages. You can add / edit the index.html or test it by pasting the link to the folder in the "path" input here

Note: rangehttpserver works well as a simple server to support range requests for locally testing duckdb parquet + large indices.


So what's going on here?

Yeah - fair question.

So I had this idea.

What if an HNSW index (hierarchical navigable small world graphs - a good way to enable searching for stuff by their underlying meaning) was just a file, and you could serve it from a CDN, and search it directly in the browser?

And what if you didn't need to load the entire thing in memory, so you could search a massive index without being RAM rich?

That would be cool.

A vector store without a server...

So yeah. Here's a proof of concept.


There's a Python file called build_index.py that builds an index using a custom hnsw algorithm that can be serialized to a couple of parquet files.

There are very likely bugs and performance problems. But it's within an order of magnitude or two of hnswlib which was fast enough that my development cycle wasn't impacted by repeatedly re-indexing the same files while building the search and front-end bits. I welcome pull requests to fix the problems and make it halfway reasonable.

Then I wrote a webpage that uses transformers.js, duckdb and some SQL to read the parquet files and search it (similar to HNSW approx nearest neighbor search) and then retrieve the associated text.

A big part of the original idea was how this could scale to massive indices.

So, I also tested using parquet range requests and only retrieving what we need from the parquet file, which worked! But since the index is only like 100MB, and each range request added overhead, loading it all into memory was about twice as fast. But, it means you could have a 1TB index and it would (theoretically) still work, which is pretty crazy.

You can try this yourself by swapping out the nodes.parquet bits in the SQL for read_parquet('${path}/nodes.parquet'). And the same with edges. DuckDB takes care of the rest.


Anyway, would love feedback and welcome contributions.

It was a fun project!

portable-hnsw's People

Contributors

jasonjmcghee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

maiquangtuan id-2

portable-hnsw's Issues

You should be able to update existing indexes

Currently you have to rebuild indexes any time you want to update them. That's too bad!

A naive solution shouldn't be too difficult...

  • add loading of indices (read the parquet files, set the fields, initialize the data)

  • when you want to add elements, adjust the max_elements, reinitialize the data matrix to be the new size, copying over existing elements, and add the new items.

WASM

We should consider converting the core front end logic to be in webassembly to improve search time.

The distance function and going from duckdb -> js are expensive.

Both of these could likely benefit by using wasm instead.

Note: executing the distance function in duckdb is another potential approach.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.