GithubHelp home page GithubHelp logo

sse's Introduction

sse: semantic search on the edge

Query your documents finding semantic similarities.

Usage

sse crawl <project> <dir> [--include pattern] [--exclude pattern]
    - Crawls the given project directory, including or excluding files that match certain patterns.

sse query <project> <query> [--limit number]
    - Runs a query on a specific project.

sse help
    - Displays this help message.

Example response

sse query hammurabi 'how to render an avatar'
  src/lib/babylon/avatars/AvatarRenderer.ts [distance 12.820]
  src/lib/babylon/avatar-rendering-system.ts [distance 13.001]
  src/lib/babylon/avatars/adr-65/customizations.ts [distance 13.209]

Setup

Warning: the status of the project is very early alpha

  1. Create a database in postgresql with the pgvector extension, for example, by running this:
CREATE ROLE datastore WITH LOGIN PASSWORD 'password' CREATEDB;
CREATE DATABASE datastore;
GRANT ALL PRIVILEGES ON DATABASE datastore TO datastore;
CREATE EXTENSION IF NOT EXISTS vector;
  1. Modify sse/load_settings.py to your liking
  2. Install dependencies: pip install psycopg2 requests flask
  3. Alias main.py to be sse in your path

Indexing folders

Usage: sse crawl <project> <dir>, for example, sse crawl my-project ~/Notes. To index documents, sse runs the following algorithm:

let $SETTINGS be the JSON parsing of file ~/.config/sse.json or the unix env variable $SSE_CONFIG 
let $DB be a connection to a postgresql database defined in $SETTINGS
Create tables "source_embeddings", "file_info", and "chunk_embedding" if they do not exist in $DB

for each $FILE in $SRC:
  let $CONTENTS be the contents of the $FILE
  let $SHASUM be the sha256 sum of $CONTENTS, base32 encoded
  let $ABS_PATH be the absolute path of $FILE
  let $MODEL_ID be the embedding model id defined in $SETTINGS
  let $CHUNKS be the tokenization of the contents of $FILE into chunks of size determined by the model's tokenizer
  let $DATE be the current timestamp
  store $SHASUM, $MODEL_ID, $DATE in table "source_embeddings"
  store unique id, $ABS_PATH, $SHASUM, $DATE in table "file_info"
  for each $CHUNK in $CHUNKS:
    let $CHUNK_HASH be the sha256 sum of $CHUNK, base32 encoded
    let $CHUNK_EMBEDDING be the result of running the embedding model on $CHUNK
    store unique id, $MODEL_ID, $CHUNK_HASH, $CHUNK, $CHUNK_EMBEDDING, $SHASUM, the index of $CHUNK in $CHUNKS, $DATE in table "chunk_embedding"

Dependencies

sse's People

Watchers

lon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.