GithubHelp home page GithubHelp logo

kejun / mdb-search Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dvsander/mdb-search

0.0 0.0 0.0 2.5 MB

Example application querying data in different ways

Home Page: http://mdb-search-lb-1487448759.eu-west-3.elb.amazonaws.com/

License: GNU General Public License v3.0

Shell 3.09% Python 39.44% HTML 57.47%

mdb-search's Introduction

Database search, Relevance search, Semantic embeddings search with MongoDB

TL;DR: A hacked together web app with a MongoDB Atlas backend using different search queries.

Skip to the live demo' (no guarantees)

Introduction

Offering a great user Search experience in applications can be difficult, but does not need to be.

This application combines several search techniques available in MongoDB on an operational dataset of movies. MongoDB is a very popular document database known for its powerful transactional and analytical capabilities on structured and semi-structured data in a JSON-like structure. The addition of relevance search and semantic vector search in the same platform and query language is very easy and simple to use, without much complexity. As a vector database, it now also stores unstructured data such as text, images, or audio, in vector embeddings (high-dimensional vectors) to make it easy to find and retrieve similar objects quickly.

  • transactional database search (MongoDB),
  • relevance search with MongoDB Atlas Search (Lucene),
  • semantic search with MongoDB Atlas Vector Search based on embeddings for text (text-embedding-ada-002),
  • semantic search with MongoDB Atlas Vector Search based on embeddings for images (clip-ViT-B-32),

Atlas Search allows relevance search and scoring capabilities based on open-source Lucene indexes. Here, I use it to search relevant movies with language support and typo correction. Relevance search

Each movie's text plot is ran through OpenAI's embedding API and those text-embedding-ada-002 embeddings are stored in MongoDB. The user's prompt is embedded and used to query in the vector database for similar content. You can search either on your input, or do a similarity search based on an existing movie's plot. Semantic Text search

Each movie's poster image is interpreted by clip-ViT-B-32. Those picture embeddings are stored in MongoDB. The user can find movies with poster images that are similar to their query. Semantic Image search

The document structure looks as follows. In blue you have the fields, nested objects and arrays with operational data. The blue are queried with database search and Atlas Search relevance search. This projects adds the fields in yellow: a base64 representation of the movie poster, ada OpenAI text embeddings and clip image embeddings, queried with Atlas Vector Search.

Document Structure

Set-up environment

You need python3 and pip.

python3 --version
python3 -m ensurepip --upgrade
pip3 install -r requirements.txt

You need a MongoDB Atlas cluster. This can be a free cluster, created on cloud.mongodb.com. Ensure database access and network access allow you to make a connection to the database. Note free clusters have a size and performance limitation, feel free to run this on a small paid cluster with lots more data.

You need to set some local environment variables, this can be local .env file

MDB_CONN=<YOUR MongoDB Atlas connection string>
DB="sample_mflix"
COLL="embedded_movies"
OPENAI_API_KEY=<YOUR OpenAI API key>

Preparing the data

Clone the mdb-search-data repo.

In there you are offered 2 options: restoring from backup or generating the embeddings yourself locally.

Enabling the relevance search and vector search in MongoDB Atlas

In Atlas, in the cluster view Search tab, enter the following JSON configuration. Use the default index name and ensure to create it on the embedded_movies collection. This is the magic that will enable dynamic full text search on fields, as well as enable the vector search indexes. No data copy needed :o

{
    "mappings": {
        "dynamic": true,
        "fields": {
            "plot_embedding": {
                "dimensions": 1536,
                "similarity": "cosine",
                "type": "knnVector"
            },
            "poster_embedding": {
                "dimensions": 512,
                "similarity": "cosine",
                "type": "knnVector"
            }
        }
    }
}

Time to run it

This is a Flask Python3 web-app.

Start the Flask app like this

flask --app app run

Or with a helper just use python like this

python app.py

You can access the web app at http://localhost:5000.

You can now:

  • use the full text search from the input field to find 'any' random set of movies with relevance search
  • use the OpenAI text embeddings dearch to find movies similar to the text sentiment you enter, sounds exotic!
  • click the button on one of the movies and see 'similar movie posters', see what happens :)

Trust the ML and the embedding model. Can you guess why these pictures are similar?

mdb-search's People

Contributors

dvsander avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.