GithubHelp home page GithubHelp logo

aflip / mood-muse Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 31 MB

Embedding based semantic search app for poetry [App and EDA notebooks]

License: BSD 2-Clause "Simplified" License

Python 0.01% Jupyter Notebook 100.00%
approximate-nearest-neighbor-search data-enrichment semantic-search sentence-embeddings vector-search vllm

mood-muse's Introduction

MoodMuse

License

An app for discovering poetry using embedding-based semantic retrieval

What is semantic search and why do we want it

Demo

MoodMuse: Demo app

Features

  • Open-ended discovery of poetry based on emotions, themes, objects or settings.
  • Efficient Approximate Nearest Neighbors (ANN) search using NGT

Overview

The app happened because I wanted to understand semantic search. I figured out the basics using the millawell/wikipedia_field_of_science dataset, but wanted to make something that would be fun to use myself and maybe share with friends. So I decided to make something that helps me find better poetry.

Data

~16000 english poems scraped from poetryfoundation.org

Embeddings and data can be accessed on googledrive

Modelling notes

Used the MTEB leaderboard and the models listed in the sentence transformers documentation and tested about 10-15 different models.

Contrary to intuition, larger language models didn't necessarily have better embeddings. This worked out great because the larger models also take much longer to embed and create much larger embeddings.

Embedding-as-service platforms like openAI are fast, but those embeddings were not great. The larger models tend to have much vaguer connection to the query than is ideal. Some vagueness is good, too much isn't. And embedding large swaths of text and holding it in a vector db somewhere is much tougher with these services.

The models that are trained for assymmetric retrieval were inferior to the ones trained on symmetric search. This too is counter-intuitive. all-mpnet-base-v2 was the best sentence-tranformer model, although BAAI/bge-base-en and thenlper/gte-base were also good.

The main problem with these models is that the max_seq_length is generally much smaller than the text that needs to be embedded. This makes for great representation of the first 300-500 or so characters and then no representation of the rest of the text. To solve this, I tried out chunking the text and max-pooling the results which definitely improved the results but I wanted more.

Further search lead to jinaai/jina-embeddings-v2-base-en. This embedding model was the best performing. These guys have figured out a way to ingest upto 8192 tokens using ALiBi. They have a fine-tuning library that looks very interesting and seem like a good alternative to the openai/anthropics of the world.

Sentence-Transformers recommends using a reranking model, and I tried them out, and while they do marginally improve the results, the improvements were not enough to justify the extra work.

Indexing and retrieval

Following the guide at pinecone and ANN benchmarks, I tried out Neighborhood Graph and Tree (NGT), FAISS and HNSW extensively on multiple datasets. I found that on smaller datasets, NGT and FAISS work the best, and on larger datasets the difference between the three is negligible. This could be because I didn't try out large enough datasets. The differences are small and some hyperparameter tuning could improve things. I implemented NGT in the app because I like Japan and I don't like Facebook.

Tech stack/Process

  1. Embed corpus on jina-embeddings-v2-base-en
  2. Index embedding using NGT
  3. Embed query using the same model
  4. Search NGT index using query embedding, retrieving based on cosine similarity
  5. Look up top results in a pandas dataframe that has the text of the poems (don't judge me, it's just 50MB and a db is too much work)
  6. Serve the top 5 hits using an Anvil app

Resources

The app takes great inspiration from the excellent Vicki Boykis, who, around the same time as when I began puttering around with semantic search, was doing the same and shared her findings in great detail. Her app for finding books by vibes - Viberary is excellent and her research on this subject was a major source of information.

Pinecone has a great online book on NLP for semantic search

Sentence-transformers docuemntation and github repo are filled with great instructions and examples on how to train, embed, retreieve etc. This site was open all the time for the last few months.

Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016. http://doi.org/10.23915/distill.00002

Interesting papers

Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow. 14, 11 (July 2021), 1964โ€“1978. https://doi.org/10.14778/3476249.3476255

Pretrained Transformers for Text Ranking: BERT and Beyond (Yates et al., NAACL 2021)

mood-muse's People

Contributors

aflip avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.