castorini / bertserini Goto Github PK
View Code? Open in Web Editor NEWBERTserini
Home Page: https://github.com/castorini/bertserini
License: Apache License 2.0
BERTserini
Home Page: https://github.com/castorini/bertserini
License: Apache License 2.0
Hi, I tried cloning the repo in google colab and running the example given in README. However, I encounter a persistent error in the following line: (No Module error)
from transformers.tokenization_bert import BasicTokenizer
from this file: https://github.com/rsvp-ai/bertserini/blob/development/bertserini/utils/utils_squad.py
Could anyone help?
!python -m bertserini.experiments.inference --dataset_path data/dev-v2.0.json
--index_path indexes/lucene-index.enwiki-20180701-paragraphs
--model_name_or_path rsvp-ai/bertserini-bert-base-squad
--output squad_bert_base_pred.json
--topk 10 --tokenizer_name rsvp-ai/bertserini-bert-base-squad
The code raises TypeError : TextInputSequence must be str.
Development branch contains the latest runnable squad inference and eval code.
@MXueguang update the training code to master branch but haven't been merged.
available index in pyserini
https://github.com/castorini/pyserini/blob/7ffa5488885218a49e50f8a1ce95235176a3d72b/pyserini/prebuilt_index_info.py#L504
TODO:
Hi,
First of all I wish to thank you for your research efforts and the repo. I want to ask you if it is possible to fix the index download since the provided wget command is not working. Hereafter the output of the command:
--2021-12-27 18:54:22-- http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.
--2021-12-27 18:56:32-- (try: 2) http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.
--2021-12-27 18:58:44-- (try: 3) http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.
Hi! I run the sample in README using Option2 (the local context), with Transformers 4.16.2.
I encountered an Error: TypeError: TextInputSequence must be str
It happened in line candidates = bert_reader.predict(question, contexts)
. Could someone help with it? Thank you a lot in advance.
We can now do this in pyserini
>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher.from_prebuilt_index('trec45')
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...
index-robust04-20191213.tar.gz: 1.70GB [00:50, 36.4MB/s]
Extracting /Users/jimmylin/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /Users/jimmylin/.cache/pyserini/indexes/index-robust04-2019121315f3d001489c97849a010b0a4734d018...
>>> searcher
<pyserini.search._searcher.SimpleSearcher object at 0x7fee58547ac8>
>>> hits = searcher.search('hubble space telescope')
>>>
>>> # Print the first 10 hits:
... for i in range(0, 10):
... print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
...
1 LA071090-0047 16.85690
2 FT934-5418 16.75630
3 FT921-7107 16.68290
4 LA052890-0021 16.37390
5 LA070990-0052 16.36460
6 LA062990-0180 16.19260
7 LA070890-0154 16.15610
8 FT934-2516 16.08950
9 LA041090-0148 16.08810
10 FT944-128 16.01920
Instead of downloading the indexes by hand, take advantage of this feature?
cc/ @MXueguang @qguo96
A silly question about the retriever part:
I was trying to index the Wikipedia dump in paragraph-level as you did. And in the paper, you mentioned that you got 29.5M paragraphs, but instead I got 33.3M paragraphs. So, I would like to ask if you did some special filter setting when you split the article into paragraphs or just easily split them by article.split("\n")
Thanks
I want to know what implications will be there for using these model in a commercial application. What about data security, license, etc and will my data be sent back to the organization maintaining it?
Now i want to try to build index for Chines corpus other than your pre-built Chinese Wiki index. From your document, i should use Anserini's script bin/IndexConnection. But i think that script is only used to build index for English corpus. Is there any way to build index for Chinese corpus? Thanks a lot!
Hi,
When I run the inference code , it freezes almost each time. No error comes, but the code freezes randomly.
Is this a common issue? Any suggestions on how to fix this?
Thanks!
I was installing and running the code on Google Colab and ran into some dependency issue.
So please consider updating the requirement.txt file
Hello at all.
I tried to use Bertserini for question answering with a self created corpus. The base example works perfect (with transformers == 3.4.0), but I am not able to find a solution for the lucene problem. I know Bertserini depends on lucene 8 while pyserini switched to lucene 9 in its latest version, so I installed https://pypi.org/project/pyserini/0.16.0/ on a separate conda environment, created a new index with it, but the problem stays the same.
When I tried to build an index with the pyserini version I got from installing bertserini I am stopped by “/home/user/anaconda3/envs/bertserini/bin/python: No module named pyserini.index.lucene“, Only solution i found for that upgrading pyserini which isn‘t an option because of the base bertserini problem.
Is there any easy way around? And sorry if this is a stupid question, but as a psychologist I have a rather weak informatic background knowledge.
edit1:
forgot to mention which command I used to create the index
python -m pyserini.index.lucene
--collection JsonCollection
--input tests/resources/sample_collection_jsonl
--index indexes/sample_collection_jsonl
--generator DefaultLuceneDocumentGenerator
--threads 1
--storePositions --storeDocvectors --storeRaw
Hi @MXueguang I see that we've already published on PyPI
- great!
https://pypi.org/project/bertserini/
Can you add instructions on replicating SQuAD results using just pip install
? Just like Pyserini, there should be "Package Installation" instructions for people who want to use it and "Development Installation" instructions for people who want to develop... https://github.com/castorini/pyserini/
Hi, I am trying to replicate the results by following README
Does ## BERT-large-wwm-uncased
and ## BERT-base-uncased
mentioned in evaluation results same to rsvp-ai/bertserini-bert-large-squad
and rsvp-ai/bertserini-bert-base-squad
respectively?
The evaluation results in my run gives:
rsvp-ai/bertserini-bert-large-squad
(0.4, {'exact_match': 41.54210028382214, 'f1': 49.45378799697662, 'recall': 51.11983858400310
5, 'precision': 49.8395951713666, 'cover': 47.228003784295176, 'overlap': 57.6631977294229})
which is different from the score under ## BERT-large-wwm-uncased
Results for rsvp-ai/bertserini-bert-base-squad
was same to the scores under ## BERT-base-uncased
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.