tensorlakeai / indexify Goto Github PK

View Code? Open in Web Editor NEW

567.0 19.0 60.0 52.51 MB

A realtime and indexing and structured extraction engine for Unstructured Data to build Generative AI Applications

Home Page: https://getindexify.ai

License: Apache License 2.0

Rust 89.18% Dockerfile 0.22% Shell 0.10% Makefile 0.74% Python 4.62% TypeScript 4.60% Ruby 0.07% HTML 0.27% CSS 0.19%

llm machine-learning retrieval

indexify's People

Contributors

Stargazers

Watchers

indexify's Issues

Improve Search Results API

Provide the content ID whose chunk we are returning
Add a /repository/get_text API to retrieve text with a given content id.

Better error for wrong extractor name

We should return an error saying extractor name {} is not found when a user sends an extraction request. Currently the coordinator throws a generic error which is not easy to understand

Introduce naming struct

Refactor let index_name = Self::escape_to_valid_indexname(index_name);

in https://github.com/diptanu/indexify/pull/119/files/ef5073f2c73c75ceed6d6b48ea8e9562b84729c8#diff-09205948da114e96073d7f57a938da4406ed6bd99c653b0933987cc48d18d2d8

pub struct IndexName(String);

impl IndexName {
  pub new(name: &str) -> Self {
      let name = // logic to escape index name
      Self {name}
    }
}

Add --verbose flag to cargo run extractors extract

Debugging extractors may be tough if they rely on docker containers. It's probably a good idea to allow a verbose flag to pipe out any messages to stdout (through a verbose flag)

Automate the release of Python SDK to PyPI

We need a GH workflow to publish the extractor and python sdk packages to pypi

Embedding extractors should support different datatypes

Currently, if someone wants to extract an embedding, only text is supported. This should be changed to support other dataformats such as images, video, audio. Reference in the SDK https://github.com/diptanu/indexify/blob/main/indexify_extractor_sdk/base_extractor.py#L54C42-L54C42

Add error to return type of extractor

Extractors may (gracefully) fail. In these cases, they should probably return an error type.
If they don't return an error, we have two options:
(1) They return an empty object: In this case, it seems like the data has been processed, which would be false
(2) The extractor panics. In this case another set of extractors will retry extracting the initial data, in this case we will be busy for no reason.

Support for SimCSE Embedding model

Error on `docker-compose up`

When going through the Getting Started guide, there is an error at the docker-compose up step.

docker-compose up
...
...
embedding-extractor-1  | 2024-01-26T21:03:58.514255Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:03:58.555997Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:03:58 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:03.513452Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:03.514135Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:03 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:08.513652Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:08.514342Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:08 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:13.514375Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:13.515315Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:13 GMT"} }, source: None }

Support PineCone as a vector database backend

Make `extractor package` more verbose

cargo run extractor package should be more verbose. Example error logs when installing dependencies:

Error: process "/bin/sh -c pip3 install --no-input tf2onnx git+https://github.com/huggingface/transformers.git accelerate pymupdf python-Levenshtein nltk install deepdoctection[pt] python-poppler pdf2image huggingface-hub ctransformers[cuda]" did not complete successfully: exit code: 1

and

starting indexify packager, version: git branch: david/invoice-extractor - sha:191ad85da46b537fffe5f3bc0883dd28aacd3cad
process "/bin/sh -c apt-get install -y  tesseract-ocr libpoppler-dev poppler-utils tesseract-ocr-eng libtesseract-eng" did not complete successfully: exit code: 100
Error: process "/bin/sh -c apt-get install -y  tesseract-ocr libpoppler-dev poppler-utils tesseract-ocr-eng libtesseract-eng" did not complete successfully: exit code: 100

so the user needs to manually resolve what package is causing the issue

Make testing extractors easier

New Extractors can be tested by running the server and by binding them to a repository. This is cumbersome because the feedback cycle is not fast.

Solution -
Create a sub-command under the indexify binary to run just an extractor

indexify executor run-extractor --python-module foobar.ExtractorClassName --content text-from-which-we-are-extracting --params `{"foo": "bar"}`

The above example allows a developer who is developing a new extractor in footer.ExtractorClassName to load the extractor on their terminal pass some text to it and also the input parameters of the extractors encoded in json.

In the future when we add support for images, videos, or any other blobs, we could enhance this to pass the pointer to files.

Error running extractors in developer guide

When developing locally, the Developing Indexify guide has some syntax for running extractors:

git clone https://github.com/tensorlakeai/indexify-extractors.git
indexify extractor  start --coordinator-addr localhost:8950 -c /path/to/indexify-extractors/embedding-extractors/minilm-l6/indexify.yaml

This worked (mostly) early yesterday (01/24/2024), but broke after #255 . Specifically the -c option was removed in #255, but now we fail a check in src/extractor/py_extractors.rs around 114-117:

indexify/src/extractor/py_extractors.rs

Lines 114 to 117 in 239551f

 let tokens: Vec<&str> = extractor_path.split(':').collect(); 

 if tokens.len() != 2 { 

 return Err(anyhow!("invalid extractor path: {}", extractor_path)); 

 }

Before #255 when running an extractor with the -c flag the resulting tokens var looked like this (using println!("tokens: {?}", tokens)):

tokens: ["minilm_l6_embedding", "MiniLML6Extractor"]

Now that -c is removed, the developer documentation needs to be updated to show how to run an extractor separately, unless that's not needed anymore? I think it is though.

I tried various versions of --extractor-path but never seemed to get it right. Whatever I put there ends up in the tokens var though, but I'm only specifying one so the check tokens.len() != 2 always fails.

Support Milvus as a vector database backend

Content api missing from Python client

The Python API should provide a way to fetch content using the /repositories/{repo}/content endpoint.

Validate if git dependencies resolve in `python_dependencies` inside extractor yaml

Validate if git dependencies resolve in python_dependencies inside extractor yaml. For example, putting git+https://github.com/huggingface/transformers.git instead of transformers. This may be useful for certain repositories, especially when special versioning is required

Bindings should return index names

Extractor bindings trigger extractors when content is added in Indexify. When Bindings are added to the system we should return the name of the indexes it would create to make it easy for developers to know the name without listing the indexes.

Extractor yaml files should have --cache / mount options to avoid model re-downloads (and similar)

Especially when using larger machine learning models (i.e. from huggingface), mounting a drive to the local huggingface cache will be useful. An optional "mount" parameter in the model yaml file could be a good way to implement this

Support Dense Passage Retrieval Algorithm

Implement the DPR algorithm as one of the retrieval mechanisms.

Add probability to search results

/search in Indexify currenlty doesn't provide the probability of the results, we need to add a probability score here - https://github.com/diptanu/indexify/blob/main/src/api.rs#L481 The value needs to come all the way from the vectorstore/quadrant. The information is available, we need to plumb it through the api.

Remove extractors_tests folder

Extractors are generally dockerized, and not accessible through our local python (except the occasional extractor for cargo test purposes). We can remove the extractors_tests directory as this is obsolete. We should figure out a new way to test extractors (docker run --rm <extractor> with simple test-cases which works well!).

Publish Indexify binaries for download

Indexify binaries needs to be published as releases and made available for download. Write a Github workflow that we can trigger to manually build the binaries for now, and in the future we can trigger them when we tag the main branch for new releases.

We need binaries for -

Linux
Windows
Mac (M series only)

Install docker in devcontainer/codespaces

Support adding contents in the form of blobs

Indexify needs to have a mechanism to upload blobs such as PDFs, images and video files, or refer to a link to a blob store where they are stored.

The API on the server for uploading blobs -

POST
/repositories/{reposistory}/files

.. and when the blob is already on a blob store and Indexify can read from there -

POST
/repositories/{repository}/remote_files

The config of the server should accept a blob storage backend -

pub struct S3Config {
   region: String,
   bucket: String,
}
pub struct BlobStorageBackend {
  s3_backend: Option<S3Config>,
  file_system: Option<FSConfig>,
}

Under the hood we can create a table to track both files which are uploaded directly to indexify or the ones we are going to download from remote storage services.

The executors should be passed along the URLs of the blobs for extraction. The executor will download them directly, and feed them into the extractors.

Update global state from content

Certain structured extraction tasks are related to maintaining a global state as content is ingested, such as summary of chat messages, emails threads, or update code snippets which are used by an agent or another extractor for structured extraction and planning. For example, Evaporate uses a two step process for structured extraction, the second step prompts the LLM to synthesize code by prompting it documents and important attributes of the doc.

Make Indexify manage global state derived from ingested content, with sampling functions to filter content. So instead of writing extracted information as rows or columns in the underlying storage, it should also have the concept of updating an artifact.

Extractor(Content, State) -> State^

Improve the local extraction process

Currently when a developer runs the following command to extract from an extractor which has not been downloaded yet, they have to download the container themselves, the indexify extract command doesn't do that. Secondly the container can't be run twice unless a user deletes the container using docker rm manually and invokes extract again.

Indexify should download the container if it doesn't exist
Indexify should remove the older container first before it creates a container.

Error in Getting Started docs

https://getindexify.io/getting_started/#data-repository

Following is the code for the python client.

repo = client.get_repository("default")
repo.add_documents([
    "Indexify is amazing!",
    "Indexify is a retrieval service for LLM agents!",
    "Kevin Durant is the best basketball player in the world."
])

Equivalent curl in the same doc

curl -v -X POST http://localhost:8900/repositories/default/add_texts \
-H "Content-Type: application/json" \
-d '{"documents": [ 
        {"text": "Indexify is amazing!"},
        {"text": "Indexify is a retrieval service for LLM agents!"}, 
        {"text": "Kevin Durant is the best basketball player in the world."}
    ]}'

The client library should convert the list to equivalent dict objects with text key or user provides a list of dict with key as text.

repo = client.get_repository("default")
repo.add_documents([
    {"text": "Indexify is amazing!"},
    {"text": "Indexify is a retrieval service for LLM agents!"},
    {"text": "Kevin Durant is the best basketball player in the world."},
])

Extractors should expose accepted content type

Extractors should expose accepted content type in the API. That would allow us to not dispatch content to them if the content's mime type doesn't match what extractors can handle. And if the user is sending a mime type that is not acceptable, we should invalidate the request at the api boundary.

Python SDK should make async requests

Add Data Transformers to Data Repository

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like -
Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

Replace Extractors by cog

https://github.com/replicate/cog

Different models may have conflicting requirements, especially in ML domains. Using dockerized environments should alleviate and allow for models and pipelines to be plug and play.

Add NER extractor

Add an extractor that adds support for Named Entity Recognition on ingested documents. The user experience is that when documents are added to the service, Indexify asynchronously goes through the documents and finds the named entities and stores them for retrieval later.

Some user cases -

A personal assistant agent write converses with a user and stores the following messages from the user in memory - "my address is 1 Hacker Way, Menlo Park, CA",
At a later stage when the user asks to deliver some goods to their address, the agent can retrieve the address of the user from Indexify instead of asking the user for their address again.

Implementation

Write an extractor which receives text and produces a list of entities.

@dataclass
class Entity:
   name: str,
   value: str


class BertEntityExtractor:
   def __init__(self):
        # Load model - https://huggingface.co/dslim/bert-base-NER

    def extract(text: str) -> List[Entity]:
         pass

This would go in the src_py folder, where all the python sources of the server live.

Write Rust Bindings for the extractor
Write an extractor like this - https://github.com/diptanu/indexify/blob/7920e3ab81717b5c39014eca026c1d6dd7866a71/src/extractors.rs#L20
Load the Python Extractor module in Rust using PyO3, see example here - https://github.com/diptanu/indexify/blob/main/src/embeddings/sentence_transformers.rs#L9
Write a unit test to show that things work.
Once we have the NER extractor in Rust, we can integrate the extractor in the service to run asynchronously when new content is added. We can create a new issue for this, and figure out the UX once we have the model and extractors in place.

Support pg_embedding for storing embedding

Support pg_embedding to store embedding vectors.

Sub Tasks

Add a server configuration block to support postgres as the vector store. For example -

  index_config:
     index_store: postgres
  postgres_config:
     addr: postgres://postgres:postgres@postgres/indexify_embeddings

Add a Postgres implementation for the vectordb Rust interface - https://github.com/diptanu/indexify/blob/main/src/vectordbs/mod.rs#L85
Add tests similar to what we have for qdrant

Attribute Extractor in the Server

A general class of extractors are going to read content and extract well defined features from them. Examples -

NER from Texts
Intent of Texts
Bounding box of images and object types
Event Detection in Audio, such as gunshots or a person entering a scene.
Name of class, functions, etc, in source code, etc.
and so on ...

We would like these extractors to require very little additional work to hook up with the service. We could model such extractors as an AttributeExtractor in the server. The interface between the extractor and the service could be something like this -

pub struct Attribute {
   type: AttributeType, 
   attribute: serde::Value // This is a json type
   metadata: HashMap<String, serde::Value> 
}

pub struct AttributeExtractor {
}

impl AttributeExtractor { 
   pub fn new(module_name: &str) -> Result<AttributeExtractor> {
       ... load the py module
   }


 // Stores attributes of a content extracted by a python module into the datastore
  pub fn extract(&self, content: &str) -> Result<()> {
      let extraction_result = py_module.extract(content);
      self.repository.store(extraction_result);
  } 
}

The attribute here would be a json allowing any attribute extractor to produce the results in a json form. We will have to do some additional work in the developer-facing API to unpack the json into proper objects such as NamedEntity, BoundingBox, AudioEvent, etc.

The python extractors could look something like this -

class ExtractionResult:
     extractor_type: ExtractorType
     payload: string // json 
    
class IntentExtractor:

     def __init__(self, model):
         self._model = model
         ....
         
     def extract(content: Any) -> ExtractionResult:
        intent = self._model.extract_intent(content)
        return ExtractionResult(type: AttributeType.Intent, intent)

Allow users to mount a cache for models in extractors

The extractors currently download all the models every time they start which is wasteful

Pass a --cache-dir to indexify extractor command which is visible to the extractor python lib, and allow extractors to use that directory if they wish to cache stuff outside the container.

Each extractor implementation can decide what they want to do with the cache so we are not making decisions on their behalf besides making the directory available.

Make Buildkit use the local registry

See moby/buildkit#2343 this seems more involved, not what I want to spend time on right now

Add an ephemeral vector storage backend

Some users have expressed interest in having an ephemeral vector storage index. Some choices are -

The first implementation doesn't need to support distributed storage, and we can assume that the indexes will be immutable.

Consider `extractors::run_extractor(extractor_config, text, file)?;` to accept a bytestring (instead, or in addition to a file_path)

Introduce Metrics-tracking / Telemetry

We have logging to stdout through tracing which uses anysc tokio under the hood.
We want something similar that we can hook up to Prometheus or Grafana to show analyzed data.
For this we can use https://github.com/open-telemetry/opentelemetry-rust

Amongst others, following items should be tracked (please edit this issue as more items come to mind):

cpu utilization
api calls (per second)
how many extractors are running
how many executors are running
how often the scheduler is invoked
Counters on API requests
Latency of api requests
Latency and counters of DB calls.
Latency and counters of extractor calls in the executor.

common system-level information:

https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/process-metrics/

Download an extractor if it's not available locally
Remove all the logs from stdout from extractors other than the extractor output. Write the logs to stderr, and if the user passes a --verbose flag then write the logs to stdout
Make indexify extractor set the python path automatically when starting locally

Create a Filter object

          Its not very clear what filters and exclude mean here, here is a suggestion -

We use a Filter object to contain a list of filters - could be equality or un-equality filters. A FilterBuilder can help in building the filters.

class FilterBuilder:
     def eq(self, key, value) -> FilterBuilder:
       ....
     def neq(self, key, value) -> FilterBuilder:
      ....
      def build(self) -> Filter:
        ....
        
        
def bind_extractor(self, extractor_name: str, index_name: str, filters: Filter): 
     ...

Example -

filter = FilterBuilder.eq("topic", "universe").neq("topic", "ocean").build()
bind_extractor("dpr", "text_embeddings", filter)

Originally posted by @diptanu in #98 (comment)

	let tokens: Vec<&str> = extractor_path.split(':').collect();
	if tokens.len() != 2 {
	return Err(anyhow!("invalid extractor path: {}", extractor_path));
	}

tensorlakeai / indexify Goto Github PK

indexify's People

Contributors

Stargazers

Watchers

Forkers

indexify's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs