GithubHelp home page GithubHelp logo

do-me / semanticfinder Goto Github PK

View Code? Open in Web Editor NEW
197.0 4.0 14.0 30.04 MB

SemanticFinder - frontend-only live semantic search with transformers.js

Home Page: https://do-me.github.io/SemanticFinder/

License: MIT License

HTML 10.73% JavaScript 83.13% CSS 0.73% Vue 1.18% Jupyter Notebook 4.22%
ai codemirror semanticsearch transformers semantic-search

semanticfinder's Introduction

SemanticFinder

Frontend-only live semantic search and chat-with-your-documents built on transformers.js

Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and latest SOTA embedding models from Huggingface.

Models

All transformers.js-compatible feature-extraction models are supported. Here is a sortable list you can go through: daily updated list. Download the compatible models table as xlsx, csv, json, parquet, or html here: https://github.com/do-me/trending-huggingface-models/.

Catalogue

You can use super fast pre-indexed examples for really large books like the Bible or Les Misérables with hundreds of pages and search the content in less than 2 seconds 🚀. Try one of these and convince yourself:

filesize textTitle textAuthor textYear textLanguage URL modelName quantized splitParam splitType characters chunks wordsToAvoidAll wordsToCheckAll wordsToAvoidAny wordsToCheckAny exportDecimals lines textNotes textSourceURL filename
4.78 Das Kapital Karl Marx 1867 de https://do-me.github.io/SemanticFinder/?hf=Das_Kapital_c1a84fba Xenova/multilingual-e5-small True 80 Words 2003807 3164 5 28673 https://ia601605.us.archive.org/13/items/KarlMarxDasKapitalpdf/KAPITAL1.pdf Das_Kapital_c1a84fba.json.gz
2.58 Divina Commedia Dante 1321 it https://do-me.github.io/SemanticFinder/?hf=Divina_Commedia_d5a0fa67 Xenova/multilingual-e5-base True 50 Words 383782 1179 5 6225 http://www.letteratura-italiana.com/pdf/divina%20commedia/08%20Inferno%20in%20versione%20italiana.pdf Divina_Commedia_d5a0fa67.json.gz
11.92 Don Quijote Miguel de Cervantes 1605 es https://do-me.github.io/SemanticFinder/?hf=Don_Quijote_14a0b44 Xenova/multilingual-e5-base True 25 Words 1047150 7186 4 12005 https://parnaseo.uv.es/lemir/revista/revista19/textos/quijote_1.pdf Don_Quijote_14a0b44.json.gz
0.06 Hansel and Gretel Brothers Grimm 1812 en https://do-me.github.io/SemanticFinder/?hf=Hansel_and_Gretel_4de079eb TaylorAI/gte-tiny True 100 Chars 5304 55 5 9 https://www.grimmstories.com/en/grimm_fairy-tales/hansel_and_gretel Hansel_and_Gretel_4de079eb.json.gz
1.74 IPCC Report 2023 IPCC 2023 en https://do-me.github.io/SemanticFinder/?hf=IPCC_Report_2023_2b260928 Supabase/bge-small-en True 200 Chars 307811 1566 5 3230 state of knowledge of climate change https://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_LongerReport.pdf IPCC_Report_2023_2b260928.json.gz
25.56 King James Bible None en https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_24f6dc4c TaylorAI/gte-tiny True 200 Chars 4556163 23056 5 80496 https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf King_James_Bible_24f6dc4c.json.gz
11.45 King James Bible None en https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_6434a78d TaylorAI/gte-tiny True 200 Chars 4556163 23056 2 80496 https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf King_James_Bible_6434a78d.json.gz
39.32 Les Misérables Victor Hugo 1862 fr https://do-me.github.io/SemanticFinder/?hf=Les_Misérables_2239df51 Xenova/multilingual-e5-base True 25 Words 3236941 19463 5 74491 All five acts included https://beq.ebooksgratuits.com/vents/Hugo-miserables-1.pdf Les_Misérables_2239df51.json.gz
0.46 REGULATION (EU) 2023/138 European Commission 2022 en https://do-me.github.io/SemanticFinder/?hf=REGULATION_(EU)_2023_138_c00e7ff6 Supabase/bge-small-en True 25 Words 76809 424 5 1323 https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32023R0138&qid=1704492501351 REGULATION_(EU)_2023_138_c00e7ff6.json.gz
0.07 Universal Declaration of Human Rights United Nations 1948 en https://do-me.github.io/SemanticFinder/?hf=Universal_Declaration_of_Human_Rights_0a7da79a TaylorAI/gte-tiny True \nArticle Regex 8623 63 5 109 30 articles https://www.un.org/en/about-us/universal-declaration-of-human-rights Universal_Declaration_of_Human_Rights_0a7da79a.json.gz

Import & Export

You can create indices yourself with one two clicks and save them. If it's something private, keep it for yourself, if it's a classic book or something you think other's might be interested in consider a PR on the Huggingface Repo or get in touch with us. Book requests are happily met if you provide us a good source link where we can do copy & paste. Simply open an issue here with [Book Request] or similar or contact us.

It goes without saying that no discriminating content will be tolerated.

Installation

Clone the repository and install dependencies with

npm install

Then run with

npm run start

If you want to build instead, run

npm run build

Afterwards, you'll find the index.html, main.css and bundle.js in dist.

Browser extension

Download the Chrome extension from Chrome webstore and pin it. Right click the extension icon for options:

  • choose distiluse-base-multilingual-cased-v2 for multilingual usage (default is English-only)
  • set a higher number for min characters to split by for larger texts

Local build

If you want to build the browser extension locally, clone the repo and cd in extension directory then run:

  • npm install
  • npm run build for a static build or
  • npm run dev for the auto-refreshing development version
  • go to Chrome extension settings with chrome://extensions
  • select Load Unpacked and choose the build folder
  • pin the extension in Chrome so you can access it easily. If it doesn't work for you, feel free to open an issue.

Speed

Tested on the entire book of Moby Dick with 660.000 characters ~13.000 lines or ~111.000 words. Initial embedding generation takes 1-2 mins on my old i7-8550U CPU with 1000 characters as segment size. Following queries take only ~2 seconds! If you want to query larger text instead or keep an entire library of books indexed use a proper vector database instead.

Features

You can customize everything!

  • Input text & search term(s)
  • Hybrid search (semantic search & full-text search)
  • Segment length (the bigger the faster, the smaller the slower)
  • Highlight colors (currently hard-coded)
  • Number of highlights are based on the threshold value. The lower, the more results.
  • Live updates
  • Easy integration of other ML-models thanks to transformers.js
  • Data privacy-friendly - your input text data is not sent to a server, it stays in your browser!

Usage ideas

  • Basic search through anything, like your personal notes (my initial motivation by the way, a huge notes.txt file I couldn't handle anymore)
  • Remember peom analysis in school? Often you look for possible Leitmotifs or recurring categories like food in Hänsel & Gretel

Future ideas

  • One could package everything nicely and use it e.g. instead of JavaScript search engines such as Lunr.js (also being used in mkdocs-material).
  • Integration in mkdocs (mkdocs-material) experimental:
    • when building the docs, slice all .md-files in chunks (length defined in mkdocs.yaml). Should be fairly large (>800 characters) for lower response time. It's also possible to build n indices with first a coarse index (mabye per document/ .md-file if the used model supports the length) and then a rfined one for the document chunks
    • build the index by calculating the embeddings for all docs/chunks
    • when a user queries the docs, a switch can toggle (fast) full-text standard search (atm with lunr.js) or experimental semantic search
    • if the latter is being toggled, the client loads the model (all-MiniLM-L6-v2 has ~30mb)
    • like in SemanticFinder, the embedding is created client-side and the cosine similarity calculated
    • the high-scored results are returned just like with lunr.js so the user shouldn't even notice a differenc ein the UI
  • Electron- or browser-based apps could be augmented with semantic search, e.g. VS Code, Atom or mobile apps.
  • Integration in personal wikis such as Obsidian, tiddlywiki etc. would save you the tedious tagging/keywords/categorisation work or could at least improve your structure further
  • Search your own browser history (thanks @Snapdeus)
  • Integration in chat apps
  • Allow PDF-uploads (conversion from PDF to text)
  • Integrate with Speech-to-Text whisper model from transformers.js to allow audio uploads.
  • Thanks to CodeMirror one could even use syntax highlighting for programming languages such as Python, JavaScript etc.

Logic

Transformers.js is doing all the heavy lifting of tokenizing the input and running the model. Without it, this demo would have been impossible.

Input

  • Text, as much as your browser can handle! The demo uses a part of "Hänsel & Gretel" but it can handle hundreds of PDF pages
  • A search term or phrase
  • The number of characters the text should be segmented in
  • A similarity threshold value. Results with lower similarity score won't be displayed.

Output

  • Three highlighted string segments, the darker the higher the similarity score.

Pipeline

  1. All scripts are loaded. The model is loaded once from HuggingFace, after cached in the browser.
  2. A user inputs some text and a search term or phrase.
  3. Depending on the approximate length to consider (unit=characters), the text is split into segments. Words themselves are never split, that's why it's approximative.
  4. The search term embedding is created.
  5. For each segment of the text, the embedding is created.
  6. Meanwhile, the cosine similarity is calculated between every segment embedding and the search term embedding. It's written to a dictionary with the segment as key and the score as value.
  7. For every iteration, the progress bar and the highlighted sections are updated in real-time depending on the highest scores in the array.
  8. The embeddings are cached in the dictionary so that subsequent queries are quite fast. The calculation of the cosine similarity is fairly speedy in comparison to the embedding generation.
  9. Only if the user changes the segment length, the embeddings must be recalculated.

Collaboration

PRs welcome!

To Dos (no priorization)

  • similarity score cutoff/threshold
  • add option for more highlights (e.g. all above certain score)
  • add stop button
  • MaterialUI for input fields or proper labels
  • create a demo without CDNs
  • separate one html properly in html, js, css
  • add npm installation
  • option for loading embeddings from file or generally allow sharing embeddings in some way
  • simplify chunking function so the original text can be loaded without issues
  • improve the color range
  • rewrite the cosine similarity function in Rust, port to WASM and load as a module for possible speedup (experimental)
  • UI overhaul
  • polish code
  • - jQuery/vanilla JS mixed
  • - clean up functions
  • - add more comments
  • add possible use cases
  • package as a standalone application (maybe with custom model choice; to be downloaded once from HF hub, then saved locally)
  • possible integration as example in transformers.js homepage

Star History

Star History Chart

Gource Map

image

Gource image created with:

gource -1280x720 --title "SemanticFinder" --seconds-per-day 0.03 --auto-skip-seconds 0.03 --bloom-intensity 0.5 --max-user-speed 500 --highlight-dirs --multi-sampling --highlight-colour 00FF00  

semanticfinder's People

Contributors

catmaniscatlord avatar do-me avatar lizozom avatar varunneal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

semanticfinder's Issues

Highlight conflicts in Extension

There are some conflicts with the highlight css class on certain websites when using the Extension including Github that can make a page pretty unusable. I believe it's this file specifically content.css

With SemanticFinder extension normally, note the highlight class name
highlight1

Result of changing the highlight to un-highlight
unhighlight1

This can also be confirmed by going onto a page where the yellow highlight is applied then disabling the extension and reloading the page. It happens on a decent number of sites across the web due to highlight class name being fairly common and apparently very common with code blocks and syntax highlighters.

A fix, I think would be to just change highlight to something more uniquely named.

That being said, this is a great extension! It's really a nice tool to have and search through pages for semantic similarities and is really fast, thanks for making it available.

Use TypeScript

As part of my own work with this project, I converted the semantic search part of the project to TypeScript.
Is this something you would be interested in merging?
If so, I could make another PR :-)

[Feature Concept] A collection of pre-indexed texts

Intro

This is a feature concept for a public collection of indexed texts.

The main idea is that there is one (or more) space(s) with publicly available bodies of text and their embeddings. This is especially interesting for

  • large bodies of text
  • medium-sized bodies of text of high public interest

Examples: Bible, Torah, Quran, public docs like the IPCC report, studies or similar

E.g. indexing the King James bible with 200 chars and a small model takes around 20 mins. Instead, loading a pre-indexed file to SemanticFinder would take a few seconds and the cosine distance calculation, sorting etc. takes only a second.

For small bodies of text (poems one-pager docs etc.) it is not really relevant as the initial indexing would be fast anyway.

Idea

My idea for the moment is a separate GitHub repo for the pre-indexed files containing the text, embeddings & settings

  • pro: separate from main repo for clear separation of concerns, the app vs. data repo (possibly allowing for other repos or places where the files could be stored)
  • con: GitHub file size of 50Mb, maybe keeping a few examples in the same repo would be good as an example?

Not quite sure where else these files could be stored for free. If there is some kind of public interest, maybe looking for a sponsor would be an option?

Initially the 50Mb file size might even be enough, considering e.g. the ~700 pages of the bible have 4.35 Mb uncompressed as plain text and ~1.32Mb gzipped (ultra setting with 7zip). Text is absolutely neglectable from a file size point of view.

Considering embeddings, the largest test I ran was 134k embeddings in ~38Mb with 2 decimals precision, yielding good results.

In the case of the bible, indexing it with 200 chars results in only ~23k embeddings so quite much under the above threshold with probably ~7Mb

Not quite sure what book/document should be much bigger.

Tasks

Considering that there is already a working minimal PoC, it shouldn't be too hard to update the logic for the current version of the app. However, in order to be sustainable, I think it's worth thinking about it more carefully:

  • Define a file specification. The most simple I can think of would be a JSON in the following format:
{
"title": "King James Bible", 
"meta": {"chunk_type":"chars", "chunk_size": 100, "model_name": "Xenova/bge-small", ...}
"embeddings": [
  {"text": "twenty thousand drams of gold, and two thousand pounds of",
  "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }, 
  {...}
  ]
}

Concatting all the text with spaces would then result in the original text. This approach might have some disadvantages though and it might be better to hold the original text as one node and then use indices like this:

{
"title": "King James Bible", 
"meta": {"chunk_type":"chars", "chunk_size": 100, "model_name": "Xenova/bge-small", ...}
"text": "all the bible text goes here .... plenty of chars! ",
"embeddings": [
{"index": [0,99], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
{"index": [100,199], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
{"index": [100,199], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
...]
} 

This would allow for double indexation like this: [0,100] and [0,9] , [10,19], ... so one could have coarser embeddings and fine-grained embeddings in the same file.

Also, one could load e.g. store the text and the indices seperately and first load the text once, then load the first index for coarse search. Then during the first search already load a finer index and so on. UX could be very nice this way.

A gzipped JSON already offers quite a good compression and it's easy to use with pako.js.
However, using a modern format like parquet would offer other benefits like speed and further compression. E.g. with https://loaders.gl/docs/modules/parquet/api-reference/parquet-loader it could be loaded fairly easily.

I would love to develop some kind of standardized format with e.g. the main open source vector DB organizations to make it interoperable. E.g. it would be so cool to send such a standardized file around and convert a "semantic JSON" file from the cloud directly to e.g. a SemanticFinder/Qdrant/Milvus/Pinecone collection and vice versa! I will ping some folks and hear their opinion about it.

For example one could simply copy and paste a text in SemanticFinder and do the indexation there with a bit of trial and error (playing around with the settings, chunking sizes etc.). When you find the right settings just hit "export to Qdrant/Milvus/Pinecone" and you're done.

  • Where to host these files, GitHub? Better free/sponsored options?
  • Choose some diverse and interesting examples (e.g. large religious, scientific, historic books etc.)
  • [Optional] It might require a rewrite of the core logic of SemanticFinder. Working with indices instead of the actual text in a dict might be better suited for this use case.
  • Implement the logic as a PoC with easy import/export functionality
  • For reduced file size: explore product quantization, e.g. with https://github.com/Lv-291/wasm-bhtsne

I would love to hear some opinions about this! (@VarunNSrivastava @lizozom)

CodeMirror marker handling

Right now there is code like:

SemanticFinder/index.html

Lines 177 to 182 in 36f69d8

function removeHighlights() {
for (let marker of markers) {
marker.clear();
}
markers = [];
}

Which can be simplified to:

function removeHighlights() {
    editor.getAllMarks().forEach(_ => _.clear());
}

While also removing the manual marker tracking array + marker pushing code.

Just pointing it out, but maybe you plan something else with the markers array?

JSDoc and type-checking

Hey domink,

once you have separate files, it is easy to add type-checking.

Example JSDoc looks like:

/**
 * @param {number} len 
 * @param {number} overlap - Only used if respectWords === false
 * @param {boolean} [respectWords=true]
 * @returns {string[]}
 */
function splitOverlappingSubstrings(len, overlap, respectWords=true) {

You can enable type checking via jsconfig.json:

{
  "compilerOptions": {
    "checkJs": true,
    "strict": true
  }
}

(implying you are using VSCode or any other IDE that respects this)

Design logo/icon

If anyone feels creative, it would be nice to have some kind of logo/icon for the project!
The original version is a little - let's say - rudimentary ;)

image

Advanced options: load any model from HF or allow local models

Right now, the model from HF is hard-coded and supports only English. While it is just one line of code and one could easily recompile the project, it would be nicer to have advanced options for custom models, either from HF or local ones in the GUI e.g. to support other languages too.

However, usability shouldn't be sacrificed, so I'd stick to all-MiniLM-L6-v2 as default because it works great with English and is super small. Under the advanced options section there should be either a drop down or a combined input text field for a custom URL or for file uploads (local models).
Not sure what's best though, but the latter seems more versatile and would clutter the UI less than having two different inputs, drop-down and input text box. But a drop-down would be more convenient for the average user and offer some space for notes (link to model explanation, supported languages etc.) maybe...

Proposal: "Advanced features/options" Controls & UI Change

As mentioned in this comment, we can probably add a menu for "advanced features". One of these controls should be how we split:

  • split by # tokens/chars
  • split by sentence
  • split by clause

We should be able to hide these controls as well as the "# chars" and "threshold" parameter in the advanced features menu.

Additionally, I think it would be best to move the submit button to be at the same line as "query":

Screen Shot 2023-07-06 at 5 44 37 PM

If this proposal seems like a good idea I'll make the relevant PR.

Performance Improvements

Orama

I just found Orama, a dependency-free TS-based vector DB which could be used instead of using a simple JSON object.

I didn't find anything about performance yet so I guess we should run our own tests in case and see whether performance improvements or simplified features like import/export of data might be worth it. @VarunNSrivastava if you already have any opinions here, let me know!

Other

Besides, I noted, that we could almost double the speed of the cosine similarity function we use atm as we calculate the magnitude of the user query embedding for every iteration/comparison again instead of calculating it once, persisting it and re-using it in the function.

Embed content on page load

Since we can offload the work to a separate thread, we could consider embedding the content once when it loads.
Then the search operation becomes trivial and streaming the results is not necessary.

This would look something like this:

window.onload = async () => {
   await init(model);
   const content = getTextFromSomeMainEl();
   const splitText = splitText(content, strategy);
   const contentEmbedding = await embedContent(splitText);  
}

(if the user chooses to change the split strategy, we would re-run this logic.

Then when searching, this only becomes an issue of:

  1. embedding the query
  2. running cosine similarity on the embedding map

What do you think?

(Eventual) GPU Acceleration

GPU Acceleration of transformers is possible, but it is hacky.

Requires an unmerged-pr version of transformers.js that relies on a patched version of onnxruntime-node.

Xenova plans on merging this PR only after onnxruntime has official support for GPU Acceleration. In the meantime, this change could be implemented, potentially as an advanced "experimental" feature.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.