neuml / txtai Goto Github PK

View Code? Open in Web Editor NEW

7.0K 80.0 499.0 47.78 MB

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Home Page: https://neuml.github.io/txtai

License: Apache License 2.0

Python 99.39% Makefile 0.24% Dockerfile 0.37%

python search machine-learning nlp semantic-search neural-search vector-search txtai llm vector-database

txtai's Introduction

All-in-one embeddings database

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling, retrieval augmented generation and more.

Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.

Summary of txtai features:

🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing
📄 Create embeddings for text, documents, audio, images and video
💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more
↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.
⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
☁️ Run local or scale out with container orchestration

txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers and FastAPI. txtai is open-source under an Apache 2.0 license.

Interested in an easy and secure way to run hosted txtai applications? Then join the txtai.cloud preview to learn more.

Why txtai?

New vector databases, LLM frameworks and everything in between are sprouting up daily. Why build with txtai?

Up and running in minutes with pip or Docker

# Get started in a couple lines
import txtai

embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
embeddings.search("positive", 1)
#[(0, 0.29862046241760254)]

Built-in API makes it easy to develop applications using your programming language of choice

# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2

CONFIG=app.yml uvicorn "txtai.api:app"
curl -X GET "http://localhost:8000/search?query=positive"

Run local - no need to ship data off to disparate remote services
Work with micromodels all the way up to large language models (LLMs)
Low footprint - install additional dependencies and scale up when needed
Learn by example - notebooks cover all available functionality

Use Cases

The following sections introduce common txtai use cases. A comprehensive set of over 50 example notebooks and applications are also available.

Semantic Search

Build semantic/similarity/vector/neural search applications.

Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

Get started with the following examples.

Notebook	Description
Introducing txtai ▶️	Overview of the functionality provided by txtai
Similarity search with images	Embed images and text into the same space for search
Build a QA database	Question matching with semantic search
Semantic Graphs	Explore topics, data connectivity and run network analysis

LLM Orchestration

LLM chains, retrieval augmented generation (RAG), chat with your data, pipelines and workflows that interface with large language models (LLMs).

Chains

Integrate LLM chains (known as workflows in txtai), multiple LLM agents and self-critique.

See below to learn more.

Notebook	Description
Prompt templates and task chains	Build model prompts and connect tasks together with workflows
Integrate LLM frameworks	Integrate llama.cpp, LiteLLM and custom generation frameworks
Build knowledge graphs with LLMs	Build knowledge graphs with LLM-driven entity extraction

Retrieval augmented generation

Retrieval augmented generation (RAG) reduces the risk of LLM hallucinations by constraining the output with a knowledge base as context. RAG is commonly used to "chat with your data".

A novel feature of txtai is that it can provide both an answer and source citation.

Notebook	Description
Build RAG pipelines with txtai	Guide on retrieval augmented generation including how to create citations
Advanced RAG with graph path traversal	Graph path traversal to collect complex sets of data for advanced RAG
Advanced RAG with guided generation	Retrieval Augmented and Guided Generation

Language Model Workflows

Language model workflows, also known as semantic workflows, connect language models together to build intelligent applications.

While LLMs are powerful, there are plenty of smaller, more specialized models that work better and faster for specific tasks. This includes models for extractive question-answering, automatic summarization, text-to-speech, transcription and translation.

Notebook	Description
Run pipeline workflows ▶️	Simple yet powerful constructs to efficiently process data
Building abstractive text summaries	Run abstractive text summarization
Transcribe audio to text	Convert audio files to text
Translate text between languages	Streamline machine translation and language detection

Installation

The easiest way to install is via pip and PyPI

pip install txtai

Python 3.8+ is supported. Using a Python virtual environment is recommended.

See the detailed install instructions for more information covering optional dependencies, environment specific prerequisites, installing from source, conda support and how to run with containers.

Model guide

See the table below for the current recommended models. These models all allow commercial use and offer a blend of speed and performance.

Component	Model(s)
Embeddings	all-MiniLM-L6-v2
Image Captions	BLIP
Labels - Zero Shot	BART-Large-MNLI
Labels - Fixed	Fine-tune with training pipeline
Large Language Model (LLM)	Mistral 7B OpenOrca
Summarization	DistilBART
Text-to-Speech	ESPnet JETS
Transcription	Whisper
Translation	OPUS Model Series

Models can be loaded as either a path from the Hugging Face Hub or a local directory. Model paths are optional, defaults are loaded when not specified. For tasks with no recommended model, txtai uses the default models as shown in the Hugging Face Tasks guide.

See the following links to learn more.

Powered by txtai

The following applications are powered by txtai.

Application	Description
txtchat	Retrieval Augmented Generation (RAG) powered search
paperai	Semantic search and workflows for medical/scientific papers
codequestion	Semantic search for developers
tldrstory	Semantic search for headlines and story text

In addition to this list, there are also many other open-source projects, published research and closed proprietary/commercial projects that have built on txtai in production.

Documentation

Full documentation on txtai including configuration settings for embeddings, pipelines, workflows, API and a FAQ with common questions/issues is available.

Contributing

For those who would like to contribute to txtai, please see this guide.

txtai's People

Contributors

Stargazers

Watchers

Forkers

abhishekkshk68 quantap seanandre rishsriv anicyber-team balatatree sree181 codeaudit a1ip gvravi satheeshcdo codinronan hatemhosny dineshjs shrinivas-io truocphamkhac-agilityio 0xflotus rosssong tahasha stjordanis tspannhw malywonsz markmotrin polymath-is gordon-parrott keshabb wnor543 madhbhavikar dennisfrei kunlqt priestd09 ssundaranathan fanchouille davidrivasphd ys610zz yyht ruanjiyang liyingkun1237 gforky altovate spreck asysc2020 arnavn101 harikrishnama-kore xiaming9880 imanojkumar csheargm yenmuse rdgozum gm0616 luojie-roger awesome-archive shashank1010 dondreojordan bigdatasciencegroup freshy969 knowledgehacker devinxzhou rtvt123 waldenn rickeyestes jeffersonzaki doinker sourestdeeds xiaojinwhu xpatronum sohamsshah sycomix pickkaa abandaru celestialized odnodn lycodeboy huangweiboy2 joshlovecoder sanen jufangshen kavithacd orctom maciejmacko dsp6414 tejastank bobycv06fpm ohsdba talestsp trendingtechnology aquibjaved davidalphafox thehumanecoder zhongbin1 ravigv gaohuan2015 ethixkr binglinchengxiash shinthor sambbb dumbalinyolo deanmarc25 juandisay dwtcourses

txtai's Issues

Error in loading one of the pretrained model from Sentence-Transformers

I am trying to load below pre-trained model from Sentence-Transformers in the Embedding function of txtai

xlm-r-100langs-bert-base-nli-stsb-mean-tokens:

But i am getting not found error.

Regards,

txtai gives Illegal instruction (core dumped)

Hi, I have successfully installed the txtai in my linux server. When I run python and do from txtai.embeddings import Embeddings it terminates the python process and gives Illegal instruction (core dumped) error. Following are the details of my linux server, can anybody help me figureout the problem and fix it. Thanks.

Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Number of cores: 40
RAM: 126GB
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]

Add support for hnswlib backend

https://github.com/nmslib/hnswlib

can't download articles.sqlite

Hi David.
I can't download articles.sqlite in Google Colab's examples.
The return code for www.kaggleusercontent.com is 403.
Is there a mistake in my operation?

Add component for zero shot classification

Add a component to wrap Hugging Face's zero shot classifier pipeline.

Support string ids

Currently, faiss add_with_ids are used to store ids. This assumes that ids are 64 bit ints. With the addition of Annoy, which only supports sequential ids and hnswlib, an id map should be created in the Embeddings instance.

Modify setup.py to conditionally install Faiss

Skip installing Faiss on Windows

GPT2 and T5 model

Hi,
I'm trying to use either the gpt2 or t5-3b models with txtai (as it is mentioned in one of the notebooks that any model listed on the Hugging Face would work), but I receive several errors:

ERROR:transformers.tokenization_utils_base:Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 176, in encode
sentence_features = self.get_sentence_features(text, longest_seq)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 219, in get_sentence_features
return self._first_module().get_sentence_features(*features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 61, in get_sentence_features
return self.tokenizer.prepare_for_model(tokens, max_length=pad_seq_length, pad_to_max_length=True, return_tensors='pt', truncation=True)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2021, in prepare_for_model
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1529, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

or for T5:

File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
out_features = self.forward(features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 25, in forward
output_states = self.auto_model(**features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 965, in forward
decoder_outputs = self.decoder(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 684, in forward
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

What am I missing?

Thanks!

Add zero-shot based similarity pipeline

Currently, the embeddings model is used for calculating similarity. The Labels model backed by Hugging Face's zero shot classifier has shown an impressive level of accuracy in labelling text.

Evaluate if this pipeline can be used to perform similarity comparisons. In this case, the input sections would be a list of documents and candidate labels would be the query.

Split extractor embedding query and QA calls

Currently, extractor.py has a single method to run an embeddings query and then run extractive QA over those results.

This should be split into two separate methods, which allows external callers to just run the embeddings search without executing the QA extraction. This will allow downstream systems more flexibility in working with the extractor process.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Add option to store word vectors with embeddings model

For word embedding models, add an option to include the word vectors model in the models path.

can we install it on Mac?

it failed on

Docker run timeout when downloading embeddings files

So, I'm very new to this but I have been able to put something together with txtai. Thanks for that! Very interesting stuff.

I built a new index and saved it, then worked up a simple flask app to load it and interface to it.

However, the initial embeddings line in there has to download the models from the net and this causes a timeout when trying to fire up the resulting container in Docker. Is there a way to pre-download these files and then point to those rather than having it try and load them? It seems to do this on my local machine, but I can not find where they are or how to reference them.

app.py

import os, json, requests
import urllib.request
from flask import Flask, abort, request, jsonify
from flask import Response
from flask_cors import CORS
from flask_restful import Resource, Api
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
embeddings.load("index")

app = Flask(__name__)

cors = CORS(app, resources={r"/*": {"origins": "*"}})
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route("/q1",  methods=['GET'])
def serch():
    q = request.args.get('q')
    results = embeddings.search(q, 10)
    data = {}  # build json from the set..
    for r in results:
        uid = r[0]
        score = r[1]
        data[str(uid)] = score
        #   print('score:{} -- {}'.format(score, text_corpus[uid]))
        print('score:{} -- {}'.format(score, uid))
    j = json.dumps(data)
    return  j

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

Dockerfile

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install Flask gunicorn
RUN pip install flask_restful
RUN pip install flask-cors
RUN pip install numpy
RUN pip install txtai

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
#CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app
CMD exec gunicorn --bind :8080 --workers 1 --threads 8 app:app

Word Embedding question 2

Hi,

In the tutorial Part 3: Build an Embeddings index from a data source, at the part, where the word vectors are built I checked the txt file, that was generated. I realized that the vector representation of letters are there and not the words?! Is this on purpose? Correct me if I'm wrong, but I think the list of words should be there with the 300 dimension vectors.

Kind regards,
mrJezy

Enhance API to fully support all txtai functionality

Currently, the API supports a subset of functionality in the embeddings module. Fully support embeddings and add methods qa extraction and labeling.

This will enable network-based implementations of txtai in other programming languages.

Refresh example notebooks and add notebook on labeling

Re-run and save the example notebooks to ensure they all work with the latest libraries.

Also add a notebook on labeling.

Add batch search

First of all, thanks a lot for your work, it's really great !
In order to user the search on lots of documents, it would be great if we could :

[ search batches of elements] : embeddings.search(queries, top_k) which would return a list of top_k results for each query in queries
[use fastest lib to do so ] (I use hnsw implem of nmslib which provides a batch search and great implementation of hnsw).

Keep up the good work !

Not accurate with long sentences

The txtai library performs less accurately when the given input matching texts are too long.

tokenizer.py

hi，

For English, it is a right tokenizer such as
tokens = [token.strip(string.punctuation) for token in text.lower().split()]
and
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
But for other language, for example Chinese, it makes wrong thing, so could you please revise this for different language, or give me some advice?

Thanks!

Add API tests

The functionality provided via the txtai API has increased significantly. Improve test coverage in that area.

Q&A Extractor Sample Code Not Functioning As Expected

I have run the following sample code for the extractor to perform Q&A on OS X but the results return None:

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
sections = ["Giants hit 3 HRs to down Dodgers",
            "Giants 5 Dodgers 4 final",
            "Dodgers drop Game 2 against the Giants, 5-4",
            "Blue Jays 2 Red Sox 1 final",
            "Red Sox lost to the Blue Jays, 2-1",
            "Blue Jays at Red Sox is over. Score: 2-1",
            "Phillies win over the Braves, 5-0",
            "Phillies 5 Braves 0 final",
            "Final: Braves lose to the Phillies in the series opener, 5-0",
            "Final score: Flyers 4 Lightning 1",
            "Flyers 4 Lightning 1 final",
            "Flyers win 4-1"]
sections = [(uid, section) for uid, section in enumerate(sections)]
questions = ["What team won the game?", "What was score?"]
execute = lambda query: extractor(sections, [(question, query, question, False) for question in questions])
for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
    print("----", query, "----")
    for answer in execute(query):
        print(answer)
    print()

Results:

---- Red Sox - Blue Jays ----
('What team won the game?', None)
('What was score?', None)

---- Phillies - Braves ----
('What team won the game?', None)
('What was score?', None)

---- Dodgers - Giants ----
('What team won the game?', None)
('What was score?', None)

---- Flyers - Lightning ----
('What team won the game?', None)
('What was score?', None)

Make API definitions consistent

With the additional functionality added to txtai over the last few releases, the API definitions have gotten somewhat inconsistent. This issue will address that and make many of the return types across modules consistent. The changes are breaking in many cases and will require a bump of the major version of txtai to v2.

The current Python API definitions for v1 are:

Current Python API v1

embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.similarity("query text", documents)
return [score]
embeddings.add(documents)
embeddings.index()
embeddings.transform("text")
return [float]
extractor(sections, queue)
return [(name, answer)]
labels("text", ["label1"])
return [(label, score)] sort score desc

The new method templates and return types are below.

New Python API v2

embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.batchsearch(["query text1", "query text2])
return [[(id, score)] sort score desc]
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
return [(id, score)] sort score desc
embeddings.batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]
embeddings.transform("text")
return [float]
embeddings.batchtransform(["text1", "text2"])
return [[float]]
extractor(queue, texts)
return [(name, answer)]
labels("text", ["label1"])
return [(id, score)] sort score desc
labels(["text1", "text2"], ["label1"])
return [[(id, score)] sort score desc]
similarity("query text", texts)
return [(id, score)] sort score desc
batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]

External v2 API Calls

The API methods also need to have corresponding changes.

Given that json doesn't support tuples and some languages can't easily map arrays/tuples to objects, the return types are mapped from tuples to json objects. For example instead of (id, score) the API will return {"id": value, "score": value}.

The API also has the following differences with the native Python API.

extract uses the Extractor pipeline which is a callable object in Python.
label/batchlabel uses the Labels pipeline which is a callable object in Python that supports both string and list input.
similarity/batchsimilarity uses the Similarity pipeline which is a callable object in Python that supports both string and list input.

The following list shows how the API methods will look through language binding libraries.

embeddings.search("query text")
embeddings.batchsearch(["query text1", "query text2])
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
embeddings.batchsimilarity(["query text1", "query text2], texts)
embeddings.transform("text")
embeddings.batchTransform(["text1", "text2"])
extractor.extract(questions, texts)
labels.label("text", ["label1"])
labels.batchlabel(["text1", "text2"], ["label1"])
similarity.similarity("query text", texts)
similarity.batchsimilarity(["query text1", "query text2], texts)

All methods should operate on batches

For search, similarity, extractive qa and labels, all methods should operate on batches for the best performance.

Extractive QA already supports this.
Search, similarity and labels should work with batches. Separate methods (if necessary) can be retained to provide existing functionality for a single record.

Add unit tests and integrate Travis CI

Add testing framework and integrate Travis CI

Can we have a CONTRIBUTING.md for a quick guide?

It can have sections such as:

To contribute a feature/fix
How can you help
Getting Started
Formatting and Linting rules
Connect on Slack etc. to get help or for issues

P.S. I am new to NeuML and found this an interesting initiative. Would love to contribute!

Remove build script workaround

The combination of pip 20.3.x, transformers 4.x and sentence-transformers 0.3.9 has caused build errors not related to txtai.

Remove the following lines from build.yml once they are resolved upstream.

# Remove hardcoding to 20.2.4
pip install -U pip==20.2.4 wheel coverage coverall

Switch from faiss-gpu to faiss-cpu

Given that GPU builds aren't being used and reported issues with macOS, switch to faiss-cpu package

Integrate FastAPI for model serving

Add pattern for serving models. API should be driven by a configuration yaml file listing the model name and path.

API endpoints:

/$model/search?q=value
- Runs a search against model for query q
/$model/similarity?t1=text&t2=text
- Compares t1 and t2 for similarity using model
/$model/embedding?t=text
- Builds a sentence embeddings vector for text stored in t

Migrate from Travis CI to GitHub Actions

travis-ci.org builds are frequently backlogged more than an hour, which doesn't work for continuous development. Migrate to GitHub actions.

Once successful, revoke all third-party app access.

Add documentation for Embeddings settings to README

Fix build warnings with hnswlib

hnswlib requires numpy and is failing on the build of the wheel. The build process is falling back to the legacy build which will be removed in pip 21.x.

Different Language Support

Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?

Review, organize and update example notebooks

Do a refresh and reorganization of the example notebooks.

（2nd） can't download articles.sqlite

Oh, my...
Also, code 403 has prevented you from downloading...

Next time, save it locally so you can download it again.
Please.

Enable flag to enable/disable Faiss SQ8 quantization

Currently hardcoded to SQ8. Annoy/hnswlib only support float32, quantization will be ignored for those backends.

Add support for Annoy backend

Currently, embeddings indices only support storing data in Faiss. Given that Faiss isn't supported on Windows, refactor to allow pluggable ANN backends.

https://github.com/spotify/annoy

Upgrade to Transformers 4.x

Transformers 4.x has numerous performance improvements. Upgrade dependency to require at least 4.0.0.

[Feature] txtai as a proxy like nboost

Hi,

Hope you are all well !

I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?

i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.

Thanks for your insights and inputs about this question.

Cheers,
X

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))

This is the original code from Introudcing txtai.py

`import numpy as np

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
uid = np.argmax(embeddings.similarity(query, sections))

print("%-20s %s" % (query, sections[uid]))`

A problem occurs from colab when executing the following line of code:
uid = np.argmax(embeddings.similarity(query, sections))

It shows : "ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))"

The problem doesn't occur a few days ago.

Using huggingface's datasets library as key part of the pipeline

I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.

Key advantages of making this the key part of the pipeline are as follows.

An interface to a memory mapped dataframe which is fast. This allows running a neural model on the data and saving it and caching it very easy.
datasets library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.

Refactor pipeline component

Currently, the pipeline component has logic to workaround a performance issue in Transformers < 4.0. This performance issue has been resolved. Refactor this component to directly use the pipeline component.

Also consolidate labels methods into the pipeline module.

Update transformers requirement to latest

Currently, transformers is fixed to 3.0.2 due to an issue with sentence-transformers.

Once sentence-transformers v0.3.6 is released, which will support 3.1.x, update setup.py accordingly.

Word Embedding Question

I am trying to run the Example 1 with word embeddings using the following code:
embeddings = Embeddings({"path": "word-vectors/GoogleNews-vectors-negative300.magnitude",
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
candidates = embeddings.similarity(query, sections)
uid = np.argmax(candidates)

print("%-20s %s" % (query, sections[uid]))

But I am getting the following error:

Traceback (most recent call last):
File "2.py", line 24, in
candidates = embeddings.similarity(query, sections)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/vectors.py", line 155, in transform
weights = self.scoring.weights(document) if self.scoring else None
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 133, in weights
weights.append(self.score(freq, idf, length))
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 217, in score
k = self.k1 * ((1 - self.b) + self.b * length / self.avgdl)
ZeroDivisionError: float division by zero

Do I need to do something with the word embeddings before I can use it for similarity search ?

Unable to install txtai, below is the error. I have installed c++ build tools

File "c:\users\gaussfer\anaconda3\lib\distutils\command\build_ext.py", line 340, in run
    self.build_extensions()
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 50, in build_extensions
    self._remove_flag('-Wstrict-prototypes')
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 58, in _remove_flag
    compiler = self.compiler.compiler
AttributeError: 'MSVCCompiler' object has no attribute 'compiler'
----------------------------------------

ERROR: Command errored out with exit status 1: 'c:\users\gaussfer\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"'; file='"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Gaussfer\AppData\Local\Temp\pip-record-wnt_ovu1\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\gaussfer\anaconda3\Include\faiss-gpu' Check the logs for full command output.

can you add gpu / cpu indicator to the package installation?

Hi,
It is creating faiss-cpu>=1.6.3; os_name != "nt" error during the installation under the GPUed environment.
You may want to distinguish packages with different environments
Thank you.

Language and Locale

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

Add batch indexing for transformer indices

Currently, sentence-transformer based indices are indexing documents one at a time. Calls to sentence-transformers should be batched together to decrease indexing time.

Add model distillation methods

Review the methods here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/distillation

See if this can be integrated to reduce the storage necessary for embeddings indices.