GithubHelp home page GithubHelp logo

neuml / txtai Goto Github PK

View Code? Open in Web Editor NEW
7.0K 80.0 499.0 47.78 MB

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Home Page: https://neuml.github.io/txtai

License: Apache License 2.0

Python 99.39% Makefile 0.24% Dockerfile 0.37%
python search machine-learning nlp semantic-search neural-search vector-search txtai llm vector-database

txtai's Introduction

All-in-one embeddings database

Version GitHub last commit GitHub issues Join Slack Build Status Coverage Status

txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

architecture architecture

Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling, retrieval augmented generation and more.

Embeddings databases can stand on their own and/or serve as a powerful knowledge source for large language model (LLM) prompts.

Summary of txtai features:

  • 🔎 Vector search with SQL, object storage, topic modeling, graph analysis and multimodal indexing
  • 📄 Create embeddings for text, documents, audio, images and video
  • 💡 Pipelines powered by language models that run LLM prompts, question-answering, labeling, transcription, translation, summarization and more
  • ↪️️ Workflows to join pipelines together and aggregate business logic. txtai processes can be simple microservices or multi-model workflows.
  • ⚙️ Build with Python or YAML. API bindings available for JavaScript, Java, Rust and Go.
  • ☁️ Run local or scale out with container orchestration

txtai is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers and FastAPI. txtai is open-source under an Apache 2.0 license.

Interested in an easy and secure way to run hosted txtai applications? Then join the txtai.cloud preview to learn more.

Why txtai?

why why

New vector databases, LLM frameworks and everything in between are sprouting up daily. Why build with txtai?

  • Up and running in minutes with pip or Docker
# Get started in a couple lines
import txtai

embeddings = txtai.Embeddings()
embeddings.index(["Correct", "Not what we hoped"])
embeddings.search("positive", 1)
#[(0, 0.29862046241760254)]
  • Built-in API makes it easy to develop applications using your programming language of choice
# app.yml
embeddings:
    path: sentence-transformers/all-MiniLM-L6-v2
CONFIG=app.yml uvicorn "txtai.api:app"
curl -X GET "http://localhost:8000/search?query=positive"
  • Run local - no need to ship data off to disparate remote services
  • Work with micromodels all the way up to large language models (LLMs)
  • Low footprint - install additional dependencies and scale up when needed
  • Learn by example - notebooks cover all available functionality

Use Cases

The following sections introduce common txtai use cases. A comprehensive set of over 50 example notebooks and applications are also available.

Semantic Search

Build semantic/similarity/vector/neural search applications.

demo

Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.

search search

Get started with the following examples.

Notebook Description
Introducing txtai ▶️ Overview of the functionality provided by txtai Open In Colab
Similarity search with images Embed images and text into the same space for search Open In Colab
Build a QA database Question matching with semantic search Open In Colab
Semantic Graphs Explore topics, data connectivity and run network analysis Open In Colab

LLM Orchestration

LLM chains, retrieval augmented generation (RAG), chat with your data, pipelines and workflows that interface with large language models (LLMs).

Chains

Integrate LLM chains (known as workflows in txtai), multiple LLM agents and self-critique.

llm

See below to learn more.

Notebook Description
Prompt templates and task chains Build model prompts and connect tasks together with workflows Open In Colab
Integrate LLM frameworks Integrate llama.cpp, LiteLLM and custom generation frameworks Open In Colab
Build knowledge graphs with LLMs Build knowledge graphs with LLM-driven entity extraction Open In Colab

Retrieval augmented generation

Retrieval augmented generation (RAG) reduces the risk of LLM hallucinations by constraining the output with a knowledge base as context. RAG is commonly used to "chat with your data".

rag rag

A novel feature of txtai is that it can provide both an answer and source citation.

Notebook Description
Build RAG pipelines with txtai Guide on retrieval augmented generation including how to create citations Open In Colab
Advanced RAG with graph path traversal Graph path traversal to collect complex sets of data for advanced RAG Open In Colab
Advanced RAG with guided generation Retrieval Augmented and Guided Generation Open In Colab

Language Model Workflows

Language model workflows, also known as semantic workflows, connect language models together to build intelligent applications.

flows flows

While LLMs are powerful, there are plenty of smaller, more specialized models that work better and faster for specific tasks. This includes models for extractive question-answering, automatic summarization, text-to-speech, transcription and translation.

Notebook Description
Run pipeline workflows ▶️ Simple yet powerful constructs to efficiently process data Open In Colab
Building abstractive text summaries Run abstractive text summarization Open In Colab
Transcribe audio to text Convert audio files to text Open In Colab
Translate text between languages Streamline machine translation and language detection Open In Colab

Installation

install install

The easiest way to install is via pip and PyPI

pip install txtai

Python 3.8+ is supported. Using a Python virtual environment is recommended.

See the detailed install instructions for more information covering optional dependencies, environment specific prerequisites, installing from source, conda support and how to run with containers.

Model guide

models

See the table below for the current recommended models. These models all allow commercial use and offer a blend of speed and performance.

Component Model(s)
Embeddings all-MiniLM-L6-v2
Image Captions BLIP
Labels - Zero Shot BART-Large-MNLI
Labels - Fixed Fine-tune with training pipeline
Large Language Model (LLM) Mistral 7B OpenOrca
Summarization DistilBART
Text-to-Speech ESPnet JETS
Transcription Whisper
Translation OPUS Model Series

Models can be loaded as either a path from the Hugging Face Hub or a local directory. Model paths are optional, defaults are loaded when not specified. For tasks with no recommended model, txtai uses the default models as shown in the Hugging Face Tasks guide.

See the following links to learn more.

Powered by txtai

The following applications are powered by txtai.

apps

Application Description
txtchat Retrieval Augmented Generation (RAG) powered search
paperai Semantic search and workflows for medical/scientific papers
codequestion Semantic search for developers
tldrstory Semantic search for headlines and story text

In addition to this list, there are also many other open-source projects, published research and closed proprietary/commercial projects that have built on txtai in production.

Further Reading

further further

Documentation

Full documentation on txtai including configuration settings for embeddings, pipelines, workflows, API and a FAQ with common questions/issues is available.

Contributing

For those who would like to contribute to txtai, please see this guide.

txtai's People

Contributors

0206pdh avatar 0xflotus avatar adin786 avatar babinux avatar csheargm avatar csnelsonchu avatar cstech-carl-camilleri avatar davidmezzetti avatar hi019 avatar hsm207 avatar lipusz avatar saucam avatar thealmightygrant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

txtai's Issues

txtai gives Illegal instruction (core dumped)

Hi, I have successfully installed the txtai in my linux server. When I run python and do from txtai.embeddings import Embeddings it terminates the python process and gives Illegal instruction (core dumped) error. Following are the details of my linux server, can anybody help me figureout the problem and fix it. Thanks.

Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Number of cores: 40
RAM: 126GB
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]

Support string ids

Currently, faiss add_with_ids are used to store ids. This assumes that ids are 64 bit ints. With the addition of Annoy, which only supports sequential ids and hnswlib, an id map should be created in the Embeddings instance.

GPT2 and T5 model

Hi,
I'm trying to use either the gpt2 or t5-3b models with txtai (as it is mentioned in one of the notebooks that any model listed on the Hugging Face would work), but I receive several errors:

ERROR:transformers.tokenization_utils_base:Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 176, in encode
sentence_features = self.get_sentence_features(text, longest_seq)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 219, in get_sentence_features
return self._first_module().get_sentence_features(*features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 61, in get_sentence_features
return self.tokenizer.prepare_for_model(tokens, max_length=pad_seq_length, pad_to_max_length=True, return_tensors='pt', truncation=True)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2021, in prepare_for_model
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1529, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

or for T5:

File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
out_features = self.forward(features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 25, in forward
output_states = self.auto_model(**features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 965, in forward
decoder_outputs = self.decoder(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 684, in forward
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

What am I missing?

Thanks!

Add zero-shot based similarity pipeline

Currently, the embeddings model is used for calculating similarity. The Labels model backed by Hugging Face's zero shot classifier has shown an impressive level of accuracy in labelling text.

Evaluate if this pipeline can be used to perform similarity comparisons. In this case, the input sections would be a list of documents and candidate labels would be the query.

Split extractor embedding query and QA calls

Currently, extractor.py has a single method to run an embeddings query and then run extractive QA over those results.

This should be split into two separate methods, which allows external callers to just run the embeddings search without executing the QA extraction. This will allow downstream systems more flexibility in working with the extractor process.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Docker run timeout when downloading embeddings files

So, I'm very new to this but I have been able to put something together with txtai. Thanks for that! Very interesting stuff.

I built a new index and saved it, then worked up a simple flask app to load it and interface to it.

However, the initial embeddings line in there has to download the models from the net and this causes a timeout when trying to fire up the resulting container in Docker. Is there a way to pre-download these files and then point to those rather than having it try and load them? It seems to do this on my local machine, but I can not find where they are or how to reference them.

app.py

import os, json, requests
import urllib.request
from flask import Flask, abort, request, jsonify
from flask import Response
from flask_cors import CORS
from flask_restful import Resource, Api
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
embeddings.load("index")

app = Flask(__name__)

cors = CORS(app, resources={r"/*": {"origins": "*"}})
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route("/q1",  methods=['GET'])
def serch():
    q = request.args.get('q')
    results = embeddings.search(q, 10)
    data = {}  # build json from the set..
    for r in results:
        uid = r[0]
        score = r[1]
        data[str(uid)] = score
        #   print('score:{} -- {}'.format(score, text_corpus[uid]))
        print('score:{} -- {}'.format(score, uid))
    j = json.dumps(data)
    return  j

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

Dockerfile

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install Flask gunicorn
RUN pip install flask_restful
RUN pip install flask-cors
RUN pip install numpy
RUN pip install txtai

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
#CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app
CMD exec gunicorn --bind :8080 --workers 1 --threads 8 app:app

Word Embedding question 2

Hi,

In the tutorial Part 3: Build an Embeddings index from a data source, at the part, where the word vectors are built I checked the txt file, that was generated. I realized that the vector representation of letters are there and not the words?! Is this on purpose? Correct me if I'm wrong, but I think the list of words should be there with the 300 dimension vectors.

Kind regards,
mrJezy

Enhance API to fully support all txtai functionality

Currently, the API supports a subset of functionality in the embeddings module. Fully support embeddings and add methods qa extraction and labeling.

This will enable network-based implementations of txtai in other programming languages.

Add batch search

First of all, thanks a lot for your work, it's really great !
In order to user the search on lots of documents, it would be great if we could :

  • [ search batches of elements] : embeddings.search(queries, top_k) which would return a list of top_k results for each query in queries
  • [use fastest lib to do so ] (I use hnsw implem of nmslib which provides a batch search and great implementation of hnsw).

Keep up the good work !

tokenizer.py

hi,

For English, it is a right tokenizer such as
tokens = [token.strip(string.punctuation) for token in text.lower().split()]
and
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
But for other language, for example Chinese, it makes wrong thing, so could you please revise this for different language, or give me some advice?

Thanks!

Add API tests

The functionality provided via the txtai API has increased significantly. Improve test coverage in that area.

Q&A Extractor Sample Code Not Functioning As Expected

I have run the following sample code for the extractor to perform Q&A on OS X but the results return None:

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
sections = ["Giants hit 3 HRs to down Dodgers",
            "Giants 5 Dodgers 4 final",
            "Dodgers drop Game 2 against the Giants, 5-4",
            "Blue Jays 2 Red Sox 1 final",
            "Red Sox lost to the Blue Jays, 2-1",
            "Blue Jays at Red Sox is over. Score: 2-1",
            "Phillies win over the Braves, 5-0",
            "Phillies 5 Braves 0 final",
            "Final: Braves lose to the Phillies in the series opener, 5-0",
            "Final score: Flyers 4 Lightning 1",
            "Flyers 4 Lightning 1 final",
            "Flyers win 4-1"]
sections = [(uid, section) for uid, section in enumerate(sections)]
questions = ["What team won the game?", "What was score?"]
execute = lambda query: extractor(sections, [(question, query, question, False) for question in questions])
for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
    print("----", query, "----")
    for answer in execute(query):
        print(answer)
    print()

Results:

---- Red Sox - Blue Jays ----
('What team won the game?', None)
('What was score?', None)

---- Phillies - Braves ----
('What team won the game?', None)
('What was score?', None)

---- Dodgers - Giants ----
('What team won the game?', None)
('What was score?', None)

---- Flyers - Lightning ----
('What team won the game?', None)
('What was score?', None)

Make API definitions consistent

With the additional functionality added to txtai over the last few releases, the API definitions have gotten somewhat inconsistent. This issue will address that and make many of the return types across modules consistent. The changes are breaking in many cases and will require a bump of the major version of txtai to v2.

The current Python API definitions for v1 are:

Current Python API v1

  • embeddings.search("query text")
    return [(id, score)] sort score desc

  • embeddings.similarity("query text", documents)
    return [score]

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.transform("text")
    return [float]

  • extractor(sections, queue)
    return [(name, answer)]

  • labels("text", ["label1"])
    return [(label, score)] sort score desc

The new method templates and return types are below.

New Python API v2

  • embeddings.search("query text")
    return [(id, score)] sort score desc

  • embeddings.batchsearch(["query text1", "query text2])
    return [[(id, score)] sort score desc]

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.similarity("query text", texts)
    return [(id, score)] sort score desc

  • embeddings.batchsimilarity(["query text1", "query text2], texts)
    return [[(id, score)] sort score desc]

  • embeddings.transform("text")
    return [float]

  • embeddings.batchtransform(["text1", "text2"])
    return [[float]]

  • extractor(queue, texts)
    return [(name, answer)]

  • labels("text", ["label1"])
    return [(id, score)] sort score desc

  • labels(["text1", "text2"], ["label1"])
    return [[(id, score)] sort score desc]

  • similarity("query text", texts)
    return [(id, score)] sort score desc

  • batchsimilarity(["query text1", "query text2], texts)
    return [[(id, score)] sort score desc]

External v2 API Calls

The API methods also need to have corresponding changes.

Given that json doesn't support tuples and some languages can't easily map arrays/tuples to objects, the return types are mapped from tuples to json objects. For example instead of (id, score) the API will return {"id": value, "score": value}.

The API also has the following differences with the native Python API.

  • extract uses the Extractor pipeline which is a callable object in Python.
  • label/batchlabel uses the Labels pipeline which is a callable object in Python that supports both string and list input.
  • similarity/batchsimilarity uses the Similarity pipeline which is a callable object in Python that supports both string and list input.

The following list shows how the API methods will look through language binding libraries.

  • embeddings.search("query text")
    embeddings.batchsearch(["query text1", "query text2])

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.similarity("query text", texts)
    embeddings.batchsimilarity(["query text1", "query text2], texts)

  • embeddings.transform("text")
    embeddings.batchTransform(["text1", "text2"])

  • extractor.extract(questions, texts)

  • labels.label("text", ["label1"])
    labels.batchlabel(["text1", "text2"], ["label1"])

  • similarity.similarity("query text", texts)
    similarity.batchsimilarity(["query text1", "query text2], texts)

All methods should operate on batches

For search, similarity, extractive qa and labels, all methods should operate on batches for the best performance.

  • Extractive QA already supports this.
  • Search, similarity and labels should work with batches. Separate methods (if necessary) can be retained to provide existing functionality for a single record.

Can we have a CONTRIBUTING.md for a quick guide?

It can have sections such as:

  • To contribute a feature/fix
  • How can you help
  • Getting Started
  • Formatting and Linting rules
  • Connect on Slack etc. to get help or for issues

P.S. I am new to NeuML and found this an interesting initiative. Would love to contribute!

Remove build script workaround

The combination of pip 20.3.x, transformers 4.x and sentence-transformers 0.3.9 has caused build errors not related to txtai.

Remove the following lines from build.yml once they are resolved upstream.

# Remove hardcoding to 20.2.4
pip install -U pip==20.2.4 wheel coverage coverall

Integrate FastAPI for model serving

Add pattern for serving models. API should be driven by a configuration yaml file listing the model name and path.

API endpoints:

  • /$model/search?q=value
    • Runs a search against model for query q
  • /$model/similarity?t1=text&t2=text
    • Compares t1 and t2 for similarity using model
  • /$model/embedding?t=text
    • Builds a sentence embeddings vector for text stored in t

Migrate from Travis CI to GitHub Actions

travis-ci.org builds are frequently backlogged more than an hour, which doesn't work for continuous development. Migrate to GitHub actions.

Once successful, revoke all third-party app access.

Fix build warnings with hnswlib

hnswlib requires numpy and is failing on the build of the wheel. The build process is falling back to the legacy build which will be removed in pip 21.x.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))

This is the original code from Introudcing txtai.py

`import numpy as np

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
uid = np.argmax(embeddings.similarity(query, sections))

print("%-20s %s" % (query, sections[uid]))`

A problem occurs from colab when executing the following line of code:
uid = np.argmax(embeddings.similarity(query, sections))

It shows : "ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))"

The problem doesn't occur a few days ago.

Using huggingface's datasets library as key part of the pipeline

I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.

Key advantages of making this the key part of the pipeline are as follows.

  1. An interface to a memory mapped dataframe which is fast. This allows running a neural model on the data and saving it and caching it very easy.
  2. datasets library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.

Refactor pipeline component

Currently, the pipeline component has logic to workaround a performance issue in Transformers < 4.0. This performance issue has been resolved. Refactor this component to directly use the pipeline component.

Also consolidate labels methods into the pipeline module.

Update transformers requirement to latest

Currently, transformers is fixed to 3.0.2 due to an issue with sentence-transformers.

Once sentence-transformers v0.3.6 is released, which will support 3.1.x, update setup.py accordingly.

Word Embedding Question

I am trying to run the Example 1 with word embeddings using the following code:
embeddings = Embeddings({"path": "word-vectors/GoogleNews-vectors-negative300.magnitude",
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
candidates = embeddings.similarity(query, sections)
uid = np.argmax(candidates)

print("%-20s %s" % (query, sections[uid]))

But I am getting the following error:

Traceback (most recent call last):
File "2.py", line 24, in
candidates = embeddings.similarity(query, sections)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/vectors.py", line 155, in transform
weights = self.scoring.weights(document) if self.scoring else None
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 133, in weights
weights.append(self.score(freq, idf, length))
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 217, in score
k = self.k1 * ((1 - self.b) + self.b * length / self.avgdl)
ZeroDivisionError: float division by zero

Do I need to do something with the word embeddings before I can use it for similarity search ?

Unable to install txtai, below is the error. I have installed c++ build tools

File "c:\users\gaussfer\anaconda3\lib\distutils\command\build_ext.py", line 340, in run
    self.build_extensions()
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 50, in build_extensions
    self._remove_flag('-Wstrict-prototypes')
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 58, in _remove_flag
    compiler = self.compiler.compiler
AttributeError: 'MSVCCompiler' object has no attribute 'compiler'
----------------------------------------

ERROR: Command errored out with exit status 1: 'c:\users\gaussfer\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"'; file='"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Gaussfer\AppData\Local\Temp\pip-record-wnt_ovu1\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\gaussfer\anaconda3\Include\faiss-gpu' Check the logs for full command output.

Language and Locale

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

Add batch indexing for transformer indices

Currently, sentence-transformer based indices are indexing documents one at a time. Calls to sentence-transformers should be batched together to decrease indexing time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.