GithubHelp home page GithubHelp logo

deepset-ai / haystack Goto Github PK

View Code? Open in Web Editor NEW
13.7K 124.0 1.6K 40.09 MB

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Home Page: https://haystack.deepset.ai

License: Apache License 2.0

Python 95.06% HTML 4.65% HCL 0.03% Jinja 0.26%
nlp question-answering bert language-model pytorch semantic-search squad information-retrieval summarization transformers machine-learning ai python chatgpt gpt-3 large-language-models generative-ai

haystack's Introduction

Green logo of a stylized white 'H' with the text 'Haystack, by deepset. Haystack 2.0 is live 🎉' Abstract green and yellow diagrams in the background.
CI/CD Tests code style - Black types - Mypy Coverage Status
Docs Website
Package PyPI PyPI - Downloads PyPI - Python Version Conda Version GitHub License Compliance
Meta Discord Twitter Follow

Haystack is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case.

Installation

The simplest way to get Haystack is via pip:

pip install haystack-ai

Haystack supports multiple installation methods including Docker images. For a comprehensive guide please refer to the documentation.

Documentation

If you're new to the project, check out "What is Haystack?" then go through the "Get Started Guide" and build your first LLM application in a matter of minutes. Keep learning with the tutorials. For more advanced use cases, or just to get some inspiration, you can browse our Haystack recipes in the Cookbook.

At any given point, hit the documentation to learn more about Haystack, what can it do for you and the technology behind.

Features

Important

You are currently looking at the readme of Haystack 2.0. We are still maintaining Haystack 1.x to give everyone enough time to migrate to 2.0. Switch to Haystack 1.x here.

  • Technology agnostic: Allow users the flexibility to decide what vendor or technology they want and make it easy to switch out any component for another. Haystack allows you to use and compare models available from OpenAI, Cohere and Hugging Face, as well as your own local models or models hosted on Azure, Bedrock and SageMaker.
  • Explicit: Make it transparent how different moving parts can “talk” to each other so it's easier to fit your tech stack and use case.
  • Flexible: Haystack provides all tooling in one place: database access, file conversion, cleaning, splitting, training, eval, inference, and more. And whenever custom behavior is desirable, it's easy to create custom components.
  • Extensible: Provide a uniform and easy way for the community and third parties to build their own components and foster an open ecosystem around Haystack.

Some examples of what you can do with Haystack:

  • Build retrieval augmented generation (RAG) by making use of one of the available vector databases and customizing your LLM interaction, the sky is the limit 🚀
  • Perform Question Answering in natural language to find granular answers in your documents.
  • Perform semantic search and retrieve documents according to meaning.
  • Build applications that can make complex decisions making to answer complex queries: such as systems that can resolve complex customer queries, do knowledge search on many disconnected resources and so on.
  • Scale to millions of docs using retrievers and production-scale components.
  • Use off-the-shelf models or fine-tune them to your data.
  • Use user feedback to evaluate, benchmark, and continuously improve your models.

Tip

Are you looking for a managed solution that benefits from Haystack? deepset Cloud is our fully managed, end-to-end platform to integrate LLMs with your data, which uses Haystack for the LLM pipelines architecture.

Telemetry

Haystack collects anonymous usage statistics of pipeline components. We receive an event every time these components are initialized. This way, we know which components are most relevant to our community.

Read more about telemetry in Haystack or how you can opt out in Haystack docs.

🖖 Community

If you have a feature request or a bug report, feel free to open an issue in Github. We regularly check these and you can expect a quick response. If you'd like to discuss a topic, or get more general advice on how to make Haystack work for your project, you can start a thread in Github Discussions or our Discord channel. We also check 𝕏 (Twitter) and Stack Overflow.

Contributing to Haystack

We are very open to the community's contributions - be it a quick fix of a typo, or a completely new feature! You don't need to be a Haystack expert to provide meaningful improvements. To learn how to get started, check out our Contributor Guidelines first.

There are several ways you can contribute to Haystack:

Who Uses Haystack

Here's a list of projects and companies using Haystack. Want to add yours? Open a PR, add it to the list and let the world know that you use Haystack!

haystack's People

Contributors

agnieszka-m avatar akkefa avatar anakin87 avatar awinml avatar bilgeyucel avatar bogdankostic avatar brandenchan avatar danielbichuetti avatar davidsbatista avatar dependabot[bot] avatar dfokina avatar julian-risch avatar lalitpagaria avatar masci avatar mayankjobanputra avatar michelbartels avatar oryx1729 avatar piffpaffm avatar shademe avatar silvanocerza avatar sjrl avatar tanaysoni avatar tholor avatar timoeller avatar tstadel avatar tuanacelik avatar vblagoje avatar wochinge avatar zansara avatar zoltan-fedor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

haystack's Issues

Can we use multiple indices on Elasticsearch?

To the author:
Now my problem is, can we do the inference engine without setting index=document? Because we have huge data , but the default size of the Elasticsearch index is 100MB. We have already checked the elasticsearch part. There we can use _search command outside the index. But can haystack worked if we don't set index=document? Thanks.
Jonathan

Feature Request - PDF Parser

I am wondering if it is possible to use PDF files instead of text files when writing to db? As far as I checked there is no built-in capability in the write_documents_to_db to handle it. Is it on the roadmap or is out-of-scope? Are there any suggestions to add this feature in the pipeline (robust Python library)?

dev_split in FARMReader.train() for very small datasets

Hello there,
Trying to train the model gave me the following error:
Traceback (most recent call last):
File "C:/Users/andre/Desktop/Haystack_QA/haystack/questionAnswer.py", line 18, in
reader.train(data_dir='training_data', train_filename="answers.json", use_gpu=True, n_epochs=1, save_dir='models/trained_model')
File "C:\Users\andre\Desktop\Haystack_QA\haystack\haystack\reader\farm.py", line 160, in train
data_silo = DataSilo(processor=processor, batch_size=batch_size, distributed=False)
File "C:\Users\andre\Desktop\Haystack_QA\haystack\venv\lib\site-packages\farm\data_handler\data_silo.py", line 105, in init
self._load_data()
File "C:\Users\andre\Desktop\Haystack_QA\haystack\venv\lib\site-packages\farm\data_handler\data_silo.py", line 223, in _load_data
self._create_dev_from_train()
File "C:\Users\andre\Desktop\Haystack_QA\haystack\venv\lib\site-packages\farm\data_handler\data_silo.py", line 365, in _create_dev_from_train
train_dataset, dev_dataset = self.random_split_ConcatDataset(self.data["train"], lengths=[n_train, n_dev])
File "C:\Users\andre\Desktop\Haystack_QA\haystack\venv\lib\site-packages\farm\data_handler\data_silo.py", line 397, in random_split_ConcatDataset
assert idx_dataset >= 1, "Dev_split ratio is too large, there is no data in train set. "
AssertionError: Dev_split ratio is too large, there is no data in train set. Please lower dev_split = 0.1

Setting dev_split to 0.1 did not help.
Setting the arguments dev_filename=None, dev_split=0.0 made it work.

Used dataset: 50 q&a pairs in total for 3 documents

Originally posted by @Krak91 in #95

Wrong offsets returned by FARMReader

The offsets indicating the start and end of the answer are currently wrong. Seems that it was introduced by cab0932

{
    "question": "Who is the father of Arya?",
    "answers": [
        {
            "answer": "Eddard",
            "score": 17.154027938842773,
            "probability": 0.8951305652110051,
            "context": "ry warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the",
            "offset_start": 47,
            "offset_end": 47,
            "document_id": null
        },
....

Farm top_k reader error

When using finder.get_answers() with tfidf-retriever and farm reader, top_k_reader parameter has no effect on retrieving the top K answers. It will always return the top 3 answers as FarmReader top_k_per_candidate is set to 3 by default. So in order to get n number of answers, we will have to initialize FarmReader's top_k_per_candidate by n.

Utilize existing FAQs for QA

In many cases, users already have a list of frequently asked questions (+ answers). Using this in combination with an extractive QA model can be very powerful as:

  • FAQs are highly curated content (= very meaningful answers)
  • Matching query with existing questions is computationally cheap and fast
  • FAQs (should) capture the most common queries and can therefore significantly reduce the workload on the GPU

Let's implement a retriever that supports this via a similarity measure of embeddings. While the query embedding needs to be computed at inference time, we can precompute and index the ones from the FAQs. The model could e.g. be Sentence-BERT or a model trained on the Quora duplicate questions dataset.

Question: Can I see any evaluation metric?

I'm doing this:

# pyhton:
reader.train(data_dir=train_data, train_filename="2020-02-23_answers.json", test_file_name='TEST_answers.json', use_gpu=False, n_epochs=1, dev_split=0.1)
# result:
Preprocessing Dataset: 12 Dicts [00:08,  1.46 Dicts/s]
Preprocessing Dataset: 4 Dicts [00:06,  1.56s/ Dicts]
Train epoch 1/1 (Cur. train loss: 0.0712): 100%|██████████| 87/87 [09:25<00:00,  6.50s/it]
Evaluating: 100%|██████████| 189/189 [06:49<00:00,  2.17s/it]
02/23/2020 16:40:48 - INFO - haystack.reader.farm -   Saving reader model to ../../saved_models/distilbert-base-uncased-distilled-squad

I want to see if my model improves. How I can do?
Or should I switch to FARM or use FARM directly now?

Add additional meta fields for FAQ-QA APIs

I couldn't figure out how to get additional meta fields also returned in the results response.
Currently, the results object contains

question, 
answers.answer
answers.question
answer.probability
...

How do I get other fields from the same document? I think that should come under "meta" but I am not aware of how to specify that? Is there any environment variable?

Originally posted by @sonnylaskar in #101 (comment)

Passing EXCLUDE_META_DATA_FIELDS to api

Hi,

The FAQ tutorial works but I am facing issues while running the api.
I tried this:

$ export EMBEDDING_MODEL_PATH=deepset/sentence_bert
$ export DB_INDEX=document
$ export TEXT_FIELD_NAME=answer
$ export EMBEDDING_FIELD_NAME=question_emb
$ export EMBEDDING_DIM=768
$ gunicorn haystack.api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker

On making a curl call, it throws the below error:

$ curl -X POST "http://localhost:8000/models/1/faq-qa" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"questions\":[\"How is the virus spreading?\"]}"
Inferencing Samples: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.07 Batches/s]
[2020-05-08 10:39:42 +0000] [8828] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastapi/applications.py", line 151, in __call__
    await super().__call__(scope, receive, send)  # pragma: no cover
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/applications.py", line 102, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/middleware/cors.py", line 140, in simple_response
    await self.app(scope, receive, send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/routing.py", line 550, in __call__
    await route.handle(scope, receive, send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/routing.py", line 41, in app
    response = await func(request)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastapi/routing.py", line 197, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastapi/routing.py", line 150, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/starlette/concurrency.py", line 34, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/haystack/haystack/api/controller/search.py", line 147, in faq_qa
    question=question, top_k_retriever=request.top_k_retriever, filters=request.filters,
  File "/home/ubuntu/haystack/haystack/finder.py", line 83, in get_answers_via_similar_questions
    documents = self.retriever.retrieve(question, top_k=top_k_retriever, candidate_doc_ids=candidate_doc_ids)
  File "/home/ubuntu/haystack/haystack/retriever/elasticsearch.py", line 92, in retrieve
    documents = self.document_store.query_by_embedding(query_emb[0], top_k, candidate_doc_ids)
  File "/home/ubuntu/haystack/haystack/database/elasticsearch.py", line 186, in query_by_embedding
    documents = [self._convert_es_hit_to_document(hit, score_adjustment=-1) for hit in result]
  File "/home/ubuntu/haystack/haystack/database/elasticsearch.py", line 186, in <listcomp>
    documents = [self._convert_es_hit_to_document(hit, score_adjustment=-1) for hit in result]
  File "/home/ubuntu/haystack/haystack/database/elasticsearch.py", line 199, in _convert_es_hit_to_document
    query_score=hit["_score"] + score_adjustment if hit["_score"] else None,
  File "pydantic/main.py", line 338, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Document
meta -> question_emb
  str type expected (type=type_error.str)

I think it is expecting EXCLUDE_META_DATA_FIELDS but I am not sure how to set that.
I tried export EXCLUDE_META_DATA_FIELDS=["question_emb"] and export EXCLUDE_META_DATA_FIELDS=("question_emb") but no luck.

Finder.fit() missing one required positional argument: 'self'

Going through the tutorial I ran into another issue with Finder.fit()

TypeError                                 Traceback (most recent call last)
<ipython-input-90-3a22b26b8c93> in <module>
----> 1 finder = Finder(reader, retriever)

~/signal-graph/datasci/scripts/datasci_env/lib/python3.6/site-packages/haystack/__init__.py in __init__(self, reader, retriever)
     23     def __init__(self, reader, retriever):
     24         self.retriever = retriever
---> 25         self.retriever.fit()
     26 
     27         self.reader = reader

TypeError: fit() missing 1 required positional argument: 'self'

Installed using pip install farm-haystack

Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_selec

Running into this error when trying to use a ml.p2.xlarge ec2 instance on aws for the Finetuning tutorial. It seems the tensors aren't automatically moved to the GPU.

RuntimeError                              Traceback (most recent call last)
<ipython-input-32-0bf1915c8e43> in <module>
----> 1 prediction = finder.get_answers(question=question, top_k_retriever=2, top_k_reader=5)

~/SageMaker/haystack/haystack/__init__.py in get_answers(self, question, top_k_reader, top_k_retriever, filters)
     49                                       paragrahps=paragraphs,
     50                                       meta_data_paragraphs=meta_data,
---> 51                                       top_k=top_k_reader)
     52 
     53         return results

~/SageMaker/haystack/haystack/reader/farm.py in predict(self, question, paragrahps, meta_data_paragraphs, top_k, max_processes)
    186         # get answers from QA model (Top 5 per input paragraph)
    187         predictions = self.inferencer.inference_from_dicts(
--> 188             dicts=input_dicts, rest_api_schema=True, max_processes=max_processes
    189         )
    190 

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/farm/infer.py in inference_from_dicts(self, dicts, rest_api_schema, max_processes)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/farm/infer.py in _get_predictions(self, dataset, tensor_names, baskets, rest_api_schema)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/farm/modeling/adaptive_model.py in forward(self, **kwargs)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/farm/modeling/language_model.py in forward(self, input_ids, padding_mask, **kwargs)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds)
    481 
    482         if inputs_embeds is None:
--> 483             inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
    484         tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)
    485         hidden_state = tfmr_output[0]

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids)
     87         position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)
     88 
---> 89         word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
     90         position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
     91 

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/anaconda3/envs/JupyterSystemEnv/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1482         # remove once script supports set_grad_enabled
   1483         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1485 
   1486 

RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

ElasticsearchDocumentStore.get_all_documents() return only first 10

Seems like we only get the first 10 docs back here due to the default in ES.

def get_all_documents(self):
search = Search(using=self.client, index=self.index)
documents = []
for hit in search:
documents.append(
{
"id": hit.meta["id"],
"name": hit["name"],
"text": hit["text"],
}
)

This is counter-intuitive and causes trouble if people want to use the TfidfRetriever together with it (e.g. for debugging / eval)

Elasticsearch connection error

To whom it may concern:
I have tried to convert to Elasticsearch, but there occured some problems just like this:
02/10/2020 13:58:04 - WARNING - elasticsearch - HEAD http://172.21.37.55:9200:9200/document [status:N/A request:0.027s]
Traceback (most recent call last):
File "/home/sung/anaconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/home/sung/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/home/sung/anaconda3/lib/python3.7/socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Do you have any idea? Thank you.
                                                            Jonathan

Implement and benchmark ONNX Runtime for Inference

Went with onnx-ecosystem which is a recent release (couple of weeks). Found nvidia-cuda-docker was not initializing, so I ditched Docker for now and ran this notebook from an environment with PyTorch v1.4.0, Transformers v2.5.1, ONNX runtimes v1.2.1 (CPU & GPU).

With the variables (max_seq_length=128, etc.) as originally specified, here is the result on GPU:

ONNX Runtime inference time:  0.00811

PyTorch Inference time =  0.02096
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

With max_seq_length=384, everything else the same, here is the result:

ONNX Runtime inference time:  0.0193

PyTorch Inference time =  0.0273
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

Should have more time tomorrow to examine these preliminary results and to further iterate & characterize the differences, including the notebook's variables per_gpu_eval_batch_size and eval_batch_size, both originally set to 1.

At this point I am more familiar with ALBERT_xxlarge inference performance, so eventually I may try to implement it in ONNX for an inference comparison on a larger model.

Here's another max_seq_length=384 run:
Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime_WIP - Jupyter Notebook.pdf

Originally posted by @ahotrod in #23 (comment)

Integrate Retriever from "Dense Passage Retrieval for Open-Domain Question Answering"

Interesting new paper proposing a dense retriever with two BERT encoders trained on question-passage-pairs from the most common QA datasets. I find their negative sampling approach (in-batch negatives) quite appealing.

In haystack, we already have basic support for "dense retrievers" using embeddings stored in elasticsearch. However, we would need to make sure to use two different encoder models here(one for questions, one for passages) and provide an easy way of loading the models.
Furthermore, they use dot product instead of cosine similarity.

Abstract: Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.

Paper: https://arxiv.org/abs/2004.04906
Code: https://fburl.com/qa-dpr (not yet online)

TypeError: load() missing 1 required positional argument: 'data_dir'

When trying to load a finetuned model I received the following error:

In [7]: reader = FARMReader(model_name_or_path="model_dir", use_gpu=True)                                                                                                      
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-44c12407fd4e> in <module>
----> 1 reader = FARMReader(model_name_or_path="model_dir", use_gpu=True)

~/SageMaker/haystack/haystack/reader/farm.py in __init__(self, model_name_or_path, context_window_size, no_ans_threshold, batch_size, use_gpu, n_candidates_per_passage)
     51 
     52 
---> 53         self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu, task_type="question_answering")
     54         self.inferencer.model.prediction_heads[0].context_window_size = context_window_size
     55         self.inferencer.model.prediction_heads[0].no_ans_threshold = no_ans_threshold

~/SageMaker/FARM/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len)
    135                 processor = InferenceProcessor.load_from_dir(model_name_or_path)
    136             else:
--> 137                 processor = Processor.load_from_dir(model_name_or_path)
    138 
    139         # b) or from remote transformers model hub

~/SageMaker/FARM/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
    188         del config["tokenizer"]
    189 
--> 190         processor = cls.load(tokenizer=tokenizer, processor_name=config["processor"], **config)
    191 
    192         for task_name, task in config["tasks"].items():

TypeError: load() missing 1 required positional argument: 'data_dir'

I was able to fix this by add "data_dir": "" to the processor_config.json file in the finetuned model directory. The file now looks like this :

{"baskets": [], "data_dir": "", "dev_filename": "dev-v2.0.json", "dev_split": 0, "doc_stride": 128, "max_query_length": 64, "max_seq_len": 256, "ph_output_type": "per_token_squad", "proxies": null, "target": "classification", "tasks": {"question_answering": {"label_list": ["start_token", "end_token"], "metric": "squad", "label_tensor_name": "question_answering_label_ids", "label_name": "question_answering_label", "label_column_name": null, "task_type": null}}, "test_filename": null, "train_filename": "train-v2.0.json", "tokenizer": "DistilBertTokenizer", "processor": "SquadProcessor"}

Data from Solr

Hello,

first, thank you a lot for this repo. I have a quick question.
my data is indexed in Solr. is it possible to use it instead of Elastic search ?

I can't find documentation in this repo regarding data preparation. Any guidance ?

thank you a lot in advance.

Best

Can we search the answer if we did not select index in Elasticsearch?

To: Writer
In previous version, inference.py we can left index part blank just like:
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="")
But in the latest version, if we left it blank, it would show:

Traceback (most recent call last):

File "inference2.py", line 42, in
document_store = ElasticsearchDocumentStore(host="172.18.0.3", username="", password="", index="")
File "/workspace/haystack/haystack/database/elasticsearch.py", line 49, in init
self.client.indices.create(index=index, ignore=400, body=custom_mapping)
File "/root/.local/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/root/.local/lib/python3.6/site-packages/elasticsearch/client/indices.py", line 101, in create
raise ValueError("Empty value passed for a required argument 'index'.")
ValueError: Empty value passed for a required argument 'index'.

Can you please help, thanks.

Support Camembert Model in FARMReader

I'm trying to plug Camembert (french bert) from HuggingFace into Haystack.
https://huggingface.co/transformers/model_doc/camembert.html

Notebook :

from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.io import write_documents_to_db, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers
from transformers import (CamembertConfig, CamembertModel, CamembertTokenizer) 
from haystack.database.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
doc_dir = "./test_q_a/fr"
# doc_dir = "./test_q_a/en"
write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True, split_paragraphs=True)
from haystack.retriever.elasticsearch import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

This is where I changed to plug my model.

bert = CamembertModel.from_pretrained("camembert-base")
bert_tok = CamembertTokenizer.from_pretrained("camembert-base")
reader = TransformersReader(model=bert, tokenizer=bert_tok, use_gpu=-1)
finder = Finder(reader, retriever)
prediction = finder.get_answers(question="Quels sont les symptômes ? ", top_k_retriever=10, top_k_reader=5)
print_answers(prediction, details="all")

It leads to :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-d63ddbd66c4b> in <module>
----> 1 prediction = finder.get_answers(question="Quels sont les symptômes ? ", top_k_retriever=10, top_k_reader=5)
      2 print_answers(prediction, details="minimal")

~/git_clones/haystack/haystack/finder.py in get_answers(self, question, top_k_reader, top_k_retriever, filters)
     41         len_chars = sum([len(d.text) for d in documents])
     42         logger.info(f"Reader is looking for detailed answer in {len_chars} chars ...")
---> 43         results = self.reader.predict(question=question,
     44                                       documents=documents,
     45                                       top_k=top_k_reader)

~/git_clones/haystack/haystack/reader/transformers.py in predict(self, question, documents, top_k)
     75         for doc in documents:
     76             query = {"context": doc.text, "question": question}
---> 77             predictions = self.model(query, topk=self.n_best_per_passage)
     78             # assemble and format all answers
     79             for pred in predictions:

~/.local/lib/python3.8/site-packages/transformers/pipelines.py in __call__(self, *texts, **kwargs)
   1019                 # Mask padding and question
   1020                 start_, end_ = (
-> 1021                     start_ * np.abs(np.array(feature.p_mask) - 1),
   1022                     end_ * np.abs(np.array(feature.p_mask) - 1),
   1023                 )

ValueError: operands could not be broadcast together with shapes (384,768) (384,)

AttributeError: 'NoneType' object has no attribute 'predict'

To author:
When we do the inference, the system would have the following error,

File "inference.py", line 149, in ask
top_k_reader=request.top_k_reader, filters=request.filters,
File "/workspace/haystack/haystack/finder.py", line 57, in get_answers
results = self.reader.predict(question=question,
AttributeError: 'NoneType' object has no attribute 'predict'

Do you have any idea? Thank you

Adding evaluation to Reader, Retriever & Finder

In order to compare different QA pipelines, evaluation is a core requirement.
While this is quite simple on the isolated Reader, we are more interested in the big picture of the whole pipeline & the interaction with the retriever.

Requirements:

  • Eval metrics of retriever should reflect production usage
  • It would be nice to use Squad style eval datasets (e.g. from the annotation tool or existing datasets)

Challenges:

  • We need indexing of docs at some point because we want to eval also objects like ElasticsearchRetriever
  • Comparison of metrics between runs (e.g. new data was added)
  • We don't want to impact the search experience in production (e.g. showing pure eval docs in search results)

Sketched Approach:

  • Indexing of eval data in DocumentStore in two indices (one for docs, one for annotations)
  • Subsequent eval on those

DocumentStore.add_eval_data(data=Squad-like-file, index=”eval_document”, embedding_retriever=None)

Required before any eval

  1. Adds documents from squad to index (default: new “eval_document”, also possible: existing “document”)
  • Reason for separate eval index: No duplicates, Stable metrics, Not adding pure eval documents for “production search”
  • Use additional "meta-data" fields if available in squad file, e.g.
    "paragraphs": [
    {
    "context": "text",
    "id": "abc"
    "company": "Texas"
    "year": 2017
    "filename": "some.pdf"

    "qas": [...]
  1. Adds annotations belonging to these docs to our existing “feedback” index
  • Add field “index_name” to each entry (“foreign key”)
  • Add field “origin” to differentiate between these eval labels and live user feedback (e.g. “gold_label” vs.”user_feedback”)

(3. Optional): copy some docs from production index “documents” as negative samples

Retriever.eval(label_index=”feedback”, doc_index=”eval_document”, label_origin=”gold_label” ... )

  • Retriever.eval works on the passed index (i.e. index = “feedback”)
  • Iterates over all questions in that index
  • Calls retrieve() and checks whether we retrieve the right passages via comparison of doc_ids
  • Metrics:
    -- Recall ("top-n-retrieval accuracy")
    -- Print non-ML info: For X out of X questions (x %), the answer was in the candidate passages selected by the retriever"
    -- (MAP)

Reader.eval_on_file()

  • Top_n recall, F1, EM
    => Re-use FARM functions for FARMReader?
    => Input: kind of eval dataset. SQuAD format?
    (=> Transformer functions for TransformerReader?)
    (Later extra: Eval with Cross validation of reader)

Reader.eval(index = “feedback”)

  • Iterates over questions in index
  • Get document text for question via document_id
  • Calls reader with question + document text
  • Evaluates metrics

Finder.eval()

  • Combination of Retriever+ Reader eval
  • Show metrics on whole pipeline

Tutorial 3: AttributeError bug

I am trying to run Tutorial3 and I got this error mensage:

05/04/2020 09:21:52 - INFO - haystack.finder - Reader is looking for detailed answer in 18509 chars ...
Traceback (most recent call last):
master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.py", line 103, in
prediction = finder.get_answers(question=str("what is this project?"), top_k_reader=5)
File "C:\ProgramData\Anaconda3\lib\site-packages\haystack\finder.py", line 43, in get_answers
results = self.reader.predict(question=question,
AttributeError: 'int' object has no attribute 'predict'

Using FAQ Style retrieval for Document Similarity Use Case

Hi,

I am wondering if the FAQ Style retrieval can also be used as a document similarity use case for Information retrieval.

Use Case:
Say one has lots of articles stored in elasticsearch and given an input, find the closest matching article.
This is a common doc similarity use case. So the user could follow the FAQ Style retrieval and create embedding from the article text field (lets consider that as question field used in FAQ tutorial) and an incoming input can be matched with this embedding to retrieve the best matches. That match can be considered as the most similar document given the input.
The embedding creation process might be slow because the articles can be very long.

Please comment on what do you think about this use case and this implementation using haystack.
Or if you suggest some other approach.

Thanks

Making "question" field configurable via environment variables in FAQ retrieval

Hi,

The current implementation for FAQ style retrieval expects a "question" field to be present in the document. However, I think it makes sense to make that field configurable (via env variables).
There can be cases where data has already been indexed and using a separate process one can generate the embeddings and index them to the same elasticsearch document. Currently, the user will be forced to make a copy of the query field as "question" field as a workaround.

Please advice.

finder.get_answers() got an unexpected keyword argument 'top_key_reader'

Going through the tutorial with installing from source, I came across this error

TypeError                                 Traceback (most recent call last)
<ipython-input-96-0758a98d0898> in <module>
----> 1 prediction = finder.get_answers(question="Who is the buyer?", top_k_retriever=10, top_key_reader=3)

TypeError: get_answers() got an unexpected keyword argument 'top_key_reader'

BUG: Retriever format not matching what reader is expecting

Hello, first of all. Thank you for this wonderful project.
So, I'm following this tutorial here. But I'm having the following error:

Traceback (most recent call last):
  File "haystack_test.py", line 46, in <module>
    prediction = finder.get_answers(question="What is the mass of the electron?", top_k_retriever=10, top_k_reader=5)
  File "/Users/admin/miniconda3/envs/legy/lib/python3.6/site-packages/haystack/finder.py", line 55, in get_answers
    len_chars = sum([len(d.text) for d in documents])
  File "/Users/admin/miniconda3/envs/legy/lib/python3.6/site-packages/haystack/finder.py", line 55, in <listcomp>
    len_chars = sum([len(d.text) for d in documents])
AttributeError: 'list' object has no attribute 'text'

That line is actually only for printing info. But if I skip that, below in the same file haystack/finder.py, in line 61 the reader is invoked with the same documents objects and throws an error:

Traceback (most recent call last):
  File "haystack_test.py", line 46, in <module>
    prediction = finder.get_answers(question="What is the mass of the electron?", top_k_retriever=10, top_k_reader=5)
  File "/Users/admin/miniconda3/envs/legy/lib/python3.6/site-packages/haystack/finder.py", line 61, in get_answers
    top_k=top_k_reader)
  File "/Users/admin/miniconda3/envs/legy/lib/python3.6/site-packages/haystack/reader/farm.py", line 221, in predict
    "text": doc.text,
AttributeError: 'list' object has no attribute 'text'

So my conclusion is the format the reader is expecting is different from what the retriever is returning.

NOTE: I installed the library from GitHub as recommended in the Readme file.

DeepAnnotation - upload documents

Thanks for sharing the annotation tool! looks really great and I find it very useful!

Unfortunately the upload of the files doesn't seem to work properly anymore. It's not possible to upload more than one document (or drag and drop). No matter how many you try to enter, only one uploaded document will be displayed on the page.

self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu, task_type="question_answering") got an unexpected keyword argument 'task_type'

While attempting to work through the Finetuning tutorial, I came across this issue. I'm using git clone to install. I also uninstalled farm, and reinstalled farm-haystack using git clone.

In [3]: reader = FARMReader(model_name_or_path=model_dir, use_gpu=False)                                                                       
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-340e37bbe9ff> in <module>
----> 1 reader = FARMReader(model_name_or_path=model_dir, use_gpu=False)

~/Desktop/haystack/haystack/reader/farm.py in __init__(self, model_name_or_path, context_window_size, no_ans_threshold, batch_size, use_gpu, n_candidates_per_passage)
     51 
     52 
---> 53         self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu, task_type="question_answering")
     54         self.inferencer.model.prediction_heads[0].context_window_size = context_window_size
     55         self.inferencer.model.prediction_heads[0].no_ans_threshold = no_ans_threshold

TypeError: load() got an unexpected keyword argument 'task_type'

Update tutorial 1 colab

The parameter names passed to write_documents_to_db and TfidfRetriever in the Colab notebook of Tutorial 1 need to be updated from datastore=datastore to document_store=document_store, because they currently produce TypeError.

The variable datastore = SQLDocumentStore(url="sqlite:///qa.db") should also be updated to document_store = SQLDocumentStore(url="sqlite:///qa.db") for naming consistency

Loading of SentenceTransfomer models into EmbeddingRetriever

Not sure how to load a custom pre-trained embedding model when instantiating the EmbeddingRetriever. In particular I am trying to load "bert-large-nli-stsb-mean-tokens" from SentenceTransformer, which is NOT available from the model hub

So after downloading to local server and passing the path of the zipped model files to EmbeddingRetriever, it complains about "...zip is the correct path to a directory containing a 'config.json' file"

For reference in the model folder there is a modules.json but not config.json. Renaming the file to config.json won't work so what is the expected content of config.json? In any case what is the right way to load a remote pre-trained model like this one here?

Thanks.

Built-in API returns AttributeError when Reader not loaded

Thanks a lot for this nice framework.

It runs perfectly in jupyter or python but I encoutered this bug trying to use the native API.

Launching
gunicorn haystack.api.application:app -b 0.0.0.0:8080 -k uvicorn.workers.UvicornWorker
Port 8080 because 80 was already taken

Then
curl --request POST --url 'http://127.0.0.1:8080/models/1/doc-qa' --data '{"questions": ["Who is the father of George Orwell?"]}'

And it returns

[2020-05-04 14:03:54 -0400] [8037] [INFO] Starting gunicorn 20.0.4
[2020-05-04 14:03:54 -0400] [8037] [INFO] Listening at: http://0.0.0.0:8080 (8037)
[2020-05-04 14:03:54 -0400] [8037] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2020-05-04 14:03:54 -0400] [8039] [INFO] Booting worker with pid: 8039
05/04/2020 14:03:56 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:400 request:0.004s]
05/04/2020 14:03:56 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:400 request:0.004s]
05/04/2020 14:03:56 - INFO - haystack.api.application -   Open http://127.0.0.1:8000/docs to see Swagger API Documentation.
05/04/2020 14:03:56 - INFO - haystack.api.application -   
Or just try it out directly: curl --request POST --url 'http://127.0.0.1:8000/models/1/doc-qa' --data '{"questions": ["What is the capital of Germany?"]}'

[2020-05-04 14:03:56 -0400] [8039] [INFO] Started server process [8039]
[2020-05-04 14:03:56 -0400] [8039] [INFO] Waiting for application startup.
[2020-05-04 14:03:56 -0400] [8039] [INFO] Application startup complete.
05/04/2020 14:04:01 - INFO - haystack.retriever.elasticsearch -   Got 3 candidates from retriever
05/04/2020 14:04:01 - INFO - haystack.finder -   Reader is looking for detailed answer in 167899 chars ...
[2020-05-04 14:04:01 -0400] [8039] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/home/pedro/.local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/home/pedro/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/fastapi/applications.py", line 149, in __call__
    await super().__call__(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/applications.py", line 102, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/middleware/cors.py", line 76, in __call__
    await self.app(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/routing.py", line 550, in __call__
    await route.handle(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/routing.py", line 41, in app
    response = await func(request)
  File "/home/pedro/.local/lib/python3.8/site-packages/fastapi/routing.py", line 196, in app
    raw_response = await run_endpoint_function(
  File "/home/pedro/.local/lib/python3.8/site-packages/fastapi/routing.py", line 150, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/pedro/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 34, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/pedro/git_clones/haystack/haystack/api/controller/search.py", line 117, in doc_qa
    result = finder.get_answers(
  File "/home/pedro/git_clones/haystack/haystack/finder.py", line 43, in get_answers
    results = self.reader.predict(question=question,
AttributeError: 'NoneType' object has no attribute 'predict'

Purpose of name field

Hi,

I just got started with haystack.
I was wondering what is the purpose of the "name" field.

meta_data["name"] = meta_data.pop(self.name_field)

Please suggest how to proceed.
If needed, can we add that field also as a variable in config.py

I have some data already indexed on Elastic and I can run the docker for the api as below:

docker run --name haystack-api --rm --net=host -e READER_MODEL_PATH=distilbert-base-uncased-distilled-squad -e DB_INDEX=myindex -e TEXT_FIELD_NAME=description -e SEARCH_FIELD_NAME=description -e DB_HOST=localhost -d deepset/haystack-cpu:0.2.0
But since my data doesn't have "name" field, it throws the below error:

$ curl --request POST --url 'http://127.0.0.1:8000/models/1/doc-qa' --data '{"questions": ["some random question"]}'

File "/usr/local/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.7/site-packages/starlette/middleware/cors.py", line 76, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 550, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/site-packages/starlette/routing.py", line 41, in app
    response = await func(request)
  File "/usr/local/lib/python3.7/site-packages/fastapi/routing.py", line 197, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/usr/local/lib/python3.7/site-packages/fastapi/routing.py", line 150, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/usr/local/lib/python3.7/site-packages/starlette/concurrency.py", line 34, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/haystack/api/controller/search.py", line 121, in doc_qa
    filters=request.filters,
  File "/home/user/haystack/finder.py", line 33, in get_answers
    documents = self.retriever.retrieve(question, filters=filters, top_k=top_k_retriever)
  File "/home/user/haystack/retriever/elasticsearch.py", line 45, in retrieve
    documents = self.document_store.query(query, filters, top_k, self.custom_query)
  File "/home/user/haystack/database/elasticsearch.py", line 150, in query
    documents = [self._convert_es_hit_to_document(hit) for hit in result]
  File "/home/user/haystack/database/elasticsearch.py", line 150, in <listcomp>
    documents = [self._convert_es_hit_to_document(hit) for hit in result]
  File "/home/user/haystack/database/elasticsearch.py", line 192, in _convert_es_hit_to_document
    meta_data["name"] = meta_data.pop(self.name_field)

Adopting Tutorial 1 for FAQ-style QA

My current understanding for experimenting FAQ-style QA (in Colab) with Haystack is as follow:

  1. When instantiating an ElasticsearchDocumentStore, text_field should be set to "question", and an embedding_field should be passed as param.
  2. Create an EmbeddingRetriever using stock BERT model (e.g. "sentence_bert")
  3. reader = None
  4. Make predictions by calling get_answers_via_similar_questions()
    Other than the above no other changes was made to Tutorial 1.
    When running the prediction, the following error came up:
    image

Is the above approach correct at all in handling FAQ-style QA? If so what has caused the error?

Thanks for the help!

Cannot load FarmReader model from local path

When loading a saved model the "data_dir" param is missing.

Error

TypeError: load() missing 1 required positional argument: 'data_dir'

Minimal code to reproduce

reader = FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2", use_gpu=False)
reader.save("data/albert-temp") 
reader2 = FARMReader(model_name_or_path="data/albert-temp", use_gpu=False)

ImportError: cannot import name 'pipeline' from 'transformers'

When trying to import TransformersReader, I am getting the following error

from haystack.reader.transformers import TransformersReader

Error :

>>> from haystack.reader.transformers import TransformersReader
ImportError: cannot import name 'pipeline' from 'transformers' (C:\ANACON~1\lib\site-packages\transformers\__init__.py)

Please note I am in the directory haystack-master. I am using windows 10.

Please help me in solving this.

Colab demo: finder.get_answers throws error

I was trying out the Bsic QA colab tutorial and the call to finder.get_answers throws following error

----> 1 prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

1 frames
/usr/local/lib/python3.6/dist-packages/haystack/finder.py in get_answers(self, question, top_k_reader, top_k_retriever, filters)
     53 
     54         # 3) Apply reader to get granular answer(s)
---> 55         len_chars = sum([len(d.text) for d in documents])
     56         logger.info(f"Reader is looking for detailed answer in {len_chars} chars ...")
     57         results = self.reader.predict(question=question,

/usr/local/lib/python3.6/dist-packages/haystack/finder.py in <listcomp>(.0)
     53 
     54         # 3) Apply reader to get granular answer(s)
---> 55         len_chars = sum([len(d.text) for d in documents])
     56         logger.info(f"Reader is looking for detailed answer in {len_chars} chars ...")
     57         results = self.reader.predict(question=question,

AttributeError: 'list' object has no attribute 'text'

I looked through the codebase and it seems that finder.getanswers expects a list of Document objects form retriever.retrieve but retriever.retrieve returns a tuple with 2 lists, list of paragraph text and list of meta-data dicts.

500 Internal Server Error when uvicorn FastAPI

To author:
I am going to use the uvicorn FastAPI for inference, but when everytime I use, it would show "500 Internal Server Error" just like the following figure, Do you have any idea?

Jonathan Sung
1
2

inference_from_dicts() got an unexpected keyword argument "max_processes"

I received this error with finder.get_answers() from the tutorial.

TypeError                                 Traceback (most recent call last)
<ipython-input-10-d4b62260fd95> in <module>
----> 1 finder.get_answers(question="Who is the buyer?", top_k_retriever=10, top_k_reader=3)

~/Desktop/haystack/haystack/__init__.py in get_answers(self, question, top_k_reader, top_k_retriever, filters)
     71                                       paragrahps=paragraphs,
     72                                       meta_data_paragraphs=meta_data,
---> 73                                       top_k=top_k_reader)
     74 
     75         return results

~/Desktop/haystack/haystack/reader/farm.py in predict(self, question, paragrahps, meta_data_paragraphs, top_k, max_processes)
     76         # get answers from QA model (Top 5 per input paragraph)
     77         predictions = self.model.inference_from_dicts(
---> 78             dicts=input_dicts, rest_api_schema=True, max_processes=max_processes
     79         )
     80 

TypeError: inference_from_dicts() got an unexpected keyword argument 'max_processes'

This was from the version installed with git clone.

Retrievals using TfidfRetriever are missing metadata, printing non-helpful messages

When using a Finder with a TfidfRetriever (InMemoryDocumentStore) and default TransformersReader all indices and scores are printed (see line 75 in tfidf.py), and there is no meta-data being inserted into the documents which are returned (line 96). I commented out the print call and added the following line to the Document constructor:

meta={'name':self.document_store.get_document_by_id(meta['document_id'])['name']},

Now when answers are returned from my finder I get the document name there as well. If the information being printed by that print call is deemed useful enough to keep around, maybe it could be added to a logger from logging in order to filter it out?

Haystack with Albert is awesome! XLNet question

I am in the midst of evaluating Haystack with Albert and so far it looks awesome. Loving it, thanks for sharing.

I missed the whole Game of Thrones fantasy/drama phenomenon, so for a tutorial I could understand and relate-to, I went looking for other content to use with your Tutorial1_Basic_QA_Pipeline.ipynb notebook. Being a Porschephile I settled on:

import wikipedia

porsche_wikis = wikipedia.search("Porsche", results=25)
doc_dir = "data/porsche/"

for wiki in porsche_wikis:
    html_page = wikipedia.page(title = wiki, auto_suggest = False)
    text_file = open(doc_dir + wiki.replace('/', ' ') + ".txt", "w+")
    text_file.write(html_page.content)
    text_file.close()
    print(wiki)

I can relate-to the above content and ask relevant questions of it "all day long". All other code in your notebook remains the same, except I use my Albert model for QA and it works well:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
use_gpu=True)

For my application/project, I would like to also evaluate XLNet performance with Haystack but I am having trouble loading my XLNet model:

reader = FARMReader(model_name_or_path="ahotrod/xlnet_large_squad2_512",
use_gpu=True)

Attached is the complete terminal output text, but bottom-line the error I get is:

AttributeError: 'XLNetForQuestionAnswering' object has no attribute 'qa_outputs'

output_term.txt

This XLNet model was fine-tuned on Transformers v2.1.1 and is the best I have because I and others are having problems fine-tuning XLNet_large under Transformers v2.4.1, huggingface/transformers#2651

Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is not compatible/missing the attribute mentioned in the error message?

Looking forward to additional FARM/Haystack QA capabilities you have in the works, thanks for your efforts!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.