castorini / bertserini Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 10.0 385 KB

BERTserini

Home Page: https://github.com/castorini/bertserini

License: Apache License 2.0

Python 100.00%

bertserini's People

Contributors

Stargazers

Watchers

Forkers

amyxie361 lishangli techthiyanes zl827154659 umarchaudhari rpiryani dellacortelab trungnt98-fsoft trungthanhnguyen0502

bertserini's Issues

Error while running the simple example in README

Hi, I tried cloning the repo in google colab and running the example given in README. However, I encounter a persistent error in the following line: (No Module error)
from transformers.tokenization_bert import BasicTokenizer
from this file: https://github.com/rsvp-ai/bertserini/blob/development/bertserini/utils/utils_squad.py

Could anyone help?

bertsirini.experiments.inference

!python -m bertserini.experiments.inference --dataset_path data/dev-v2.0.json
--index_path indexes/lucene-index.enwiki-20180701-paragraphs
--model_name_or_path rsvp-ai/bertserini-bert-base-squad
--output squad_bert_base_pred.json
--topk 10 --tokenizer_name rsvp-ai/bertserini-bert-base-squad

The code raises TypeError : TextInputSequence must be str.

TODO: merge master branch with development branch & testing

Development branch contains the latest runnable squad inference and eval code.
@MXueguang update the training code to master branch but haven't been merged.
available index in pyserini
https://github.com/castorini/pyserini/blob/7ffa5488885218a49e50f8a1ce95235176a3d72b/pyserini/prebuilt_index_info.py#L504

TODO:

test cmrc inference and eval in the development branch
test training code in the master branch
merge the two branches.

Wikipedia index not available for download (wget command failed)

Hi,

First of all I wish to thank you for your research efforts and the repo. I want to ask you if it is possible to fix the index download since the provided wget command is not working. Hereafter the output of the command:

--2021-12-27 18:54:22--  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

--2021-12-27 18:56:32--  (try: 2)  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

--2021-12-27 18:58:44--  (try: 3)  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

Error when running the example in README

Hi! I run the sample in README using Option2 (the local context), with Transformers 4.16.2.
I encountered an Error: TypeError: TextInputSequence must be str
It happened in line candidates = bert_reader.predict(question, contexts). Could someone help with it? Thank you a lot in advance.

Can't train!

TypeError: TextInputSequence must be str
I got this error while training with my own dataset, this happen to Squad dataset as well, any idea how to fix this?
Thanks!

Take advantage of pyserini's new prebuilt index features

We can now do this in pyserini

>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher.from_prebuilt_index('trec45')
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...
index-robust04-20191213.tar.gz: 1.70GB [00:50, 36.4MB/s]                                                                                                                                                                        
Extracting /Users/jimmylin/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /Users/jimmylin/.cache/pyserini/indexes/index-robust04-2019121315f3d001489c97849a010b0a4734d018...
>>> searcher
<pyserini.search._searcher.SimpleSearcher object at 0x7fee58547ac8>
>>> hits = searcher.search('hubble space telescope')
>>> 
>>> # Print the first 10 hits:
... for i in range(0, 10):
...     print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
... 
 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920

Instead of downloading the indexes by hand, take advantage of this feature?

cc/ @MXueguang @qguo96

Question about retriever..

A silly question about the retriever part:

I was trying to index the Wikipedia dump in paragraph-level as you did. And in the paper, you mentioned that you got 29.5M paragraphs, but instead I got 33.3M paragraphs. So, I would like to ask if you did some special filter setting when you split the article into paragraphs or just easily split them by article.split("\n")

Thanks

Is this model free to use on commercial applications?

I want to know what implications will be there for using these model in a commercial application. What about data security, license, etc and will my data be sent back to the organization maintaining it?

How to build index for Chinese corpus?

Now i want to try to build index for Chines corpus other than your pre-built Chinese Wiki index. From your document, i should use Anserini's script bin/IndexConnection. But i think that script is only used to build index for English corpus. Is there any way to build index for Chinese corpus? Thanks a lot!

Code freeze on inference

Hi,
When I run the inference code , it freezes almost each time. No error comes, but the code freezes randomly.
Is this a common issue? Any suggestions on how to fix this?
Thanks!

dependencies update

I was installing and running the code on Google Colab and ran into some dependency issue.

utils.py uses zhon which is not in the requirement.txt file yet. So i had to install it manually
torch is reinstalled, from 1.8.1 back to 1.5.1, which take more time and resources

So please consider updating the requirement.txt file

lucene 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException when using self created corpus

Hello at all.

I tried to use Bertserini for question answering with a self created corpus. The base example works perfect (with transformers == 3.4.0), but I am not able to find a solution for the lucene problem. I know Bertserini depends on lucene 8 while pyserini switched to lucene 9 in its latest version, so I installed https://pypi.org/project/pyserini/0.16.0/ on a separate conda environment, created a new index with it, but the problem stays the same.

When I tried to build an index with the pyserini version I got from installing bertserini I am stopped by “/home/user/anaconda3/envs/bertserini/bin/python: No module named pyserini.index.lucene“, Only solution i found for that upgrading pyserini which isn‘t an option because of the base bertserini problem.

Is there any easy way around? And sorry if this is a stupid question, but as a psychologist I have a rather weak informatic background knowledge.

edit1:
forgot to mention which command I used to create the index
python -m pyserini.index.lucene
--collection JsonCollection
--input tests/resources/sample_collection_jsonl
--index indexes/sample_collection_jsonl
--generator DefaultLuceneDocumentGenerator
--threads 1
--storePositions --storeDocvectors --storeRaw

PyPI instructions

Hi @MXueguang I see that we've already published on PyPI - great!
https://pypi.org/project/bertserini/

Can you add instructions on replicating SQuAD results using just pip install? Just like Pyserini, there should be "Package Installation" instructions for people who want to use it and "Development Installation" instructions for people who want to develop... https://github.com/castorini/pyserini/

replicates evaluation scores

Hi, I am trying to replicate the results by following README

Does ## BERT-large-wwm-uncased and ## BERT-base-uncased mentioned in evaluation results same to rsvp-ai/bertserini-bert-large-squad and rsvp-ai/bertserini-bert-base-squad respectively?

The evaluation results in my run gives:
rsvp-ai/bertserini-bert-large-squad

(0.4, {'exact_match': 41.54210028382214, 'f1': 49.45378799697662, 'recall': 51.11983858400310
5, 'precision': 49.8395951713666, 'cover': 47.228003784295176, 'overlap': 57.6631977294229})

which is different from the score under ## BERT-large-wwm-uncased

Results for rsvp-ai/bertserini-bert-base-squad was same to the scores under ## BERT-base-uncased

castorini / bertserini Goto Github PK

bertserini's People

Contributors

Stargazers

Watchers

Forkers

bertserini's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs