GithubHelp home page GithubHelp logo

tensorlakeai / indexify Goto Github PK

View Code? Open in Web Editor NEW
567.0 19.0 60.0 52.51 MB

A realtime and indexing and structured extraction engine for Unstructured Data to build Generative AI Applications

Home Page: https://getindexify.ai

License: Apache License 2.0

Rust 89.18% Dockerfile 0.22% Shell 0.10% Makefile 0.74% Python 4.62% TypeScript 4.60% Ruby 0.07% HTML 0.27% CSS 0.19%
llm machine-learning retrieval

indexify's People

Contributors

ak-gautam avatar akira avatar bamdadd avatar braedennorris avatar burzinpatel avatar catsby avatar delip avatar diptanu avatar jackbackes avatar khshah6 avatar kitrak-rev avatar lucasjacks0n avatar mohit-raghavendra avatar nirantk avatar oleksii-shyman avatar rakshith-ravi avatar rylandg avatar shabani1 avatar stangirala avatar tushar5526 avatar vidhyaarvind avatar vijay2win avatar yenicelik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indexify's Issues

Improve Search Results API

  • Provide the content ID whose chunk we are returning
  • Add a /repository/get_text API to retrieve text with a given content id.

Better error for wrong extractor name

We should return an error saying extractor name {} is not found when a user sends an extraction request. Currently the coordinator throws a generic error which is not easy to understand

Add error to return type of extractor

Extractors may (gracefully) fail. In these cases, they should probably return an error type.
If they don't return an error, we have two options:
(1) They return an empty object: In this case, it seems like the data has been processed, which would be false
(2) The extractor panics. In this case another set of extractors will retry extracting the initial data, in this case we will be busy for no reason.

Error on `docker-compose up`

When going through the Getting Started guide, there is an error at the docker-compose up step.

docker-compose up
...
...
embedding-extractor-1  | 2024-01-26T21:03:58.514255Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:03:58.555997Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:03:58 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:03.513452Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:03.514135Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:03 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:08.513652Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:08.514342Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:08 GMT"} }, source: None }
embedding-extractor-1  | 2024-01-26T21:04:13.514375Z  INFO indexify::executor_server: registering executor with coordinator at address 172.21.0.2:8950
embedding-extractor-1  | 2024-01-26T21:04:13.515315Z ERROR indexify::executor_server: unable to register : unable to register executor: Status { code: Unimplemented, message: "grpc-status header missing, mapped from HTTP status code 404", metadata: MetadataMap { headers: {"content-length": "0", "date": "Fri, 26 Jan 2024 21:04:13 GMT"} }, source: None }

Make `extractor package` more verbose

cargo run extractor package should be more verbose. Example error logs when installing dependencies:

Error: process "/bin/sh -c pip3 install --no-input tf2onnx git+https://github.com/huggingface/transformers.git accelerate pymupdf python-Levenshtein nltk install deepdoctection[pt] python-poppler pdf2image huggingface-hub ctransformers[cuda]" did not complete successfully: exit code: 1

and

starting indexify packager, version: git branch: david/invoice-extractor - sha:191ad85da46b537fffe5f3bc0883dd28aacd3cad
process "/bin/sh -c apt-get install -y  tesseract-ocr libpoppler-dev poppler-utils tesseract-ocr-eng libtesseract-eng" did not complete successfully: exit code: 100
Error: process "/bin/sh -c apt-get install -y  tesseract-ocr libpoppler-dev poppler-utils tesseract-ocr-eng libtesseract-eng" did not complete successfully: exit code: 100

so the user needs to manually resolve what package is causing the issue

Make testing extractors easier

New Extractors can be tested by running the server and by binding them to a repository. This is cumbersome because the feedback cycle is not fast.

Solution -
Create a sub-command under the indexify binary to run just an extractor

indexify executor run-extractor --python-module foobar.ExtractorClassName --content text-from-which-we-are-extracting --params `{"foo": "bar"}`

The above example allows a developer who is developing a new extractor in footer.ExtractorClassName to load the extractor on their terminal pass some text to it and also the input parameters of the extractors encoded in json.

In the future when we add support for images, videos, or any other blobs, we could enhance this to pass the pointer to files.

Error running extractors in developer guide

When developing locally, the Developing Indexify guide has some syntax for running extractors:

git clone https://github.com/tensorlakeai/indexify-extractors.git
indexify extractor  start --coordinator-addr localhost:8950 -c /path/to/indexify-extractors/embedding-extractors/minilm-l6/indexify.yaml

This worked (mostly) early yesterday (01/24/2024), but broke after #255 . Specifically the -c option was removed in #255, but now we fail a check in src/extractor/py_extractors.rs around 114-117:

let tokens: Vec<&str> = extractor_path.split(':').collect();
if tokens.len() != 2 {
return Err(anyhow!("invalid extractor path: {}", extractor_path));
}

Before #255 when running an extractor with the -c flag the resulting tokens var looked like this (using println!("tokens: {?}", tokens)):

tokens: ["minilm_l6_embedding", "MiniLML6Extractor"]

Now that -c is removed, the developer documentation needs to be updated to show how to run an extractor separately, unless that's not needed anymore? I think it is though.

I tried various versions of --extractor-path but never seemed to get it right. Whatever I put there ends up in the tokens var though, but I'm only specifying one so the check tokens.len() != 2 always fails.

Bindings should return index names

Extractor bindings trigger extractors when content is added in Indexify. When Bindings are added to the system we should return the name of the indexes it would create to make it easy for developers to know the name without listing the indexes.

Remove extractors_tests folder

Extractors are generally dockerized, and not accessible through our local python (except the occasional extractor for cargo test purposes). We can remove the extractors_tests directory as this is obsolete. We should figure out a new way to test extractors (docker run --rm <extractor> with simple test-cases which works well!).

Publish Indexify binaries for download

Indexify binaries needs to be published as releases and made available for download. Write a Github workflow that we can trigger to manually build the binaries for now, and in the future we can trigger them when we tag the main branch for new releases.

We need binaries for -

  1. Linux
  2. Windows
  3. Mac (M series only)

Support adding contents in the form of blobs

Indexify needs to have a mechanism to upload blobs such as PDFs, images and video files, or refer to a link to a blob store where they are stored.

The API on the server for uploading blobs -

POST
/repositories/{reposistory}/files

.. and when the blob is already on a blob store and Indexify can read from there -

POST
/repositories/{repository}/remote_files

The config of the server should accept a blob storage backend -

pub struct S3Config {
   region: String,
   bucket: String,
}
pub struct BlobStorageBackend {
  s3_backend: Option<S3Config>,
  file_system: Option<FSConfig>,
}

Under the hood we can create a table to track both files which are uploaded directly to indexify or the ones we are going to download from remote storage services.

The executors should be passed along the URLs of the blobs for extraction. The executor will download them directly, and feed them into the extractors.

Update global state from content

Certain structured extraction tasks are related to maintaining a global state as content is ingested, such as summary of chat messages, emails threads, or update code snippets which are used by an agent or another extractor for structured extraction and planning. For example, Evaporate uses a two step process for structured extraction, the second step prompts the LLM to synthesize code by prompting it documents and important attributes of the doc.

Make Indexify manage global state derived from ingested content, with sampling functions to filter content. So instead of writing extracted information as rows or columns in the underlying storage, it should also have the concept of updating an artifact.

Extractor(Content, State) -> State^

Improve the local extraction process

Currently when a developer runs the following command to extract from an extractor which has not been downloaded yet, they have to download the container themselves, the indexify extract command doesn't do that. Secondly the container can't be run twice unless a user deletes the container using docker rm manually and invokes extract again.

  1. Indexify should download the container if it doesn't exist
  2. Indexify should remove the older container first before it creates a container.

Error in Getting Started docs

https://getindexify.io/getting_started/#data-repository

Following is the code for the python client.

repo = client.get_repository("default")
repo.add_documents([
    "Indexify is amazing!",
    "Indexify is a retrieval service for LLM agents!",
    "Kevin Durant is the best basketball player in the world."
])

Equivalent curl in the same doc

curl -v -X POST http://localhost:8900/repositories/default/add_texts \
-H "Content-Type: application/json" \
-d '{"documents": [ 
        {"text": "Indexify is amazing!"},
        {"text": "Indexify is a retrieval service for LLM agents!"}, 
        {"text": "Kevin Durant is the best basketball player in the world."}
    ]}'

The client library should convert the list to equivalent dict objects with text key or user provides a list of dict with key as text.

repo = client.get_repository("default")
repo.add_documents([
    {"text": "Indexify is amazing!"},
    {"text": "Indexify is a retrieval service for LLM agents!"},
    {"text": "Kevin Durant is the best basketball player in the world."},
])

Extractors should expose accepted content type

Extractors should expose accepted content type in the API. That would allow us to not dispatch content to them if the content's mime type doesn't match what extractors can handle. And if the user is sending a mime type that is not acceptable, we should invalidate the request at the api boundary.

Add Data Transformers to Data Repository

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like -
Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

Add NER extractor

Add an extractor that adds support for Named Entity Recognition on ingested documents. The user experience is that when documents are added to the service, Indexify asynchronously goes through the documents and finds the named entities and stores them for retrieval later.

Some user cases -

  • A personal assistant agent write converses with a user and stores the following messages from the user in memory - "my address is 1 Hacker Way, Menlo Park, CA",
  • At a later stage when the user asks to deliver some goods to their address, the agent can retrieve the address of the user from Indexify instead of asking the user for their address again.

Implementation

  1. Write an extractor which receives text and produces a list of entities.
@dataclass
class Entity:
   name: str,
   value: str


class BertEntityExtractor:
   def __init__(self):
        # Load model - https://huggingface.co/dslim/bert-base-NER

    def extract(text: str) -> List[Entity]:
         pass

This would go in the src_py folder, where all the python sources of the server live.

  1. Write Rust Bindings for the extractor
    Write an extractor like this - https://github.com/diptanu/indexify/blob/7920e3ab81717b5c39014eca026c1d6dd7866a71/src/extractors.rs#L20
    Load the Python Extractor module in Rust using PyO3, see example here - https://github.com/diptanu/indexify/blob/main/src/embeddings/sentence_transformers.rs#L9

  2. Write a unit test to show that things work.

  3. Once we have the NER extractor in Rust, we can integrate the extractor in the service to run asynchronously when new content is added. We can create a new issue for this, and figure out the UX once we have the model and extractors in place.

Attribute Extractor in the Server

A general class of extractors are going to read content and extract well defined features from them. Examples -

  1. NER from Texts
  2. Intent of Texts
  3. Bounding box of images and object types
  4. Event Detection in Audio, such as gunshots or a person entering a scene.
  5. Name of class, functions, etc, in source code, etc.
    and so on ...

We would like these extractors to require very little additional work to hook up with the service. We could model such extractors as an AttributeExtractor in the server. The interface between the extractor and the service could be something like this -

pub struct Attribute {
   type: AttributeType, 
   attribute: serde::Value // This is a json type
   metadata: HashMap<String, serde::Value> 
}
pub struct AttributeExtractor {
}

impl AttributeExtractor { 
   pub fn new(module_name: &str) -> Result<AttributeExtractor> {
       ... load the py module
   }


 // Stores attributes of a content extracted by a python module into the datastore
  pub fn extract(&self, content: &str) -> Result<()> {
      let extraction_result = py_module.extract(content);
      self.repository.store(extraction_result);
  } 
}

The attribute here would be a json allowing any attribute extractor to produce the results in a json form. We will have to do some additional work in the developer-facing API to unpack the json into proper objects such as NamedEntity, BoundingBox, AudioEvent, etc.

The python extractors could look something like this -

class ExtractionResult:
     extractor_type: ExtractorType
     payload: string // json 
    
class IntentExtractor:

     def __init__(self, model):
         self._model = model
         ....
         
     def extract(content: Any) -> ExtractionResult:
        intent = self._model.extract_intent(content)
        return ExtractionResult(type: AttributeType.Intent, intent)

Allow users to mount a cache for models in extractors

The extractors currently download all the models every time they start which is wasteful

Pass a --cache-dir to indexify extractor command which is visible to the extractor python lib, and allow extractors to use that directory if they wish to cache stuff outside the container.

Each extractor implementation can decide what they want to do with the cache so we are not making decisions on their behalf besides making the directory available.

Introduce Metrics-tracking / Telemetry

We have logging to stdout through tracing which uses anysc tokio under the hood.
We want something similar that we can hook up to Prometheus or Grafana to show analyzed data.
For this we can use https://github.com/open-telemetry/opentelemetry-rust

Amongst others, following items should be tracked (please edit this issue as more items come to mind):

  • cpu utilization
  • api calls (per second)
  • how many extractors are running
  • how many executors are running
  • how often the scheduler is invoked
  • Counters on API requests
  • Latency of api requests
  • Latency and counters of DB calls.
  • Latency and counters of extractor calls in the executor.

common system-level information:

Filter content to extract from by an extractor by sampling a percentage of all the content in a repository

Sometimes its useful to extract information only from a subset of the repository to run small scale experiments or for code synthesis for extraction before the extracted code is run against all the documents in the repository.

ExtractorBindng should have a sampling function based filter which allows extracting from N% of all the documents. And as documents come the binding should trigger extraction to maintain the watermark percentage.

Pending work for extractors

  • Download an extractor if it's not available locally
  • Remove all the logs from stdout from extractors other than the extractor output. Write the logs to stderr, and if the user passes a --verbose flag then write the logs to stdout
  • Make indexify extractor set the python path automatically when starting locally

Create a Filter object

          Its not very clear what filters and exclude mean here, here is a suggestion -

We use a Filter object to contain a list of filters - could be equality or un-equality filters. A FilterBuilder can help in building the filters.

class FilterBuilder:
     def eq(self, key, value) -> FilterBuilder:
       ....
     def neq(self, key, value) -> FilterBuilder:
      ....
      def build(self) -> Filter:
        ....
        
        
def bind_extractor(self, extractor_name: str, index_name: str, filters: Filter): 
     ...

Example -

filter = FilterBuilder.eq("topic", "universe").neq("topic", "ocean").build()
bind_extractor("dpr", "text_embeddings", filter)

Originally posted by @diptanu in #98 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.