GithubHelp home page GithubHelp logo

bertserini's People

Contributors

amyxie361 avatar luchentan avatar mxueguang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bertserini's Issues

bertsirini.experiments.inference

!python -m bertserini.experiments.inference --dataset_path data/dev-v2.0.json
--index_path indexes/lucene-index.enwiki-20180701-paragraphs
--model_name_or_path rsvp-ai/bertserini-bert-base-squad
--output squad_bert_base_pred.json
--topk 10 --tokenizer_name rsvp-ai/bertserini-bert-base-squad

The code raises TypeError : TextInputSequence must be str.

Wikipedia index not available for download (wget command failed)

Hi,

First of all I wish to thank you for your research efforts and the repo. I want to ask you if it is possible to fix the index download since the provided wget command is not working. Hereafter the output of the command:

--2021-12-27 18:54:22--  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

--2021-12-27 18:56:32--  (try: 2)  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

--2021-12-27 18:58:44--  (try: 3)  http://72.143.107.253/BERTserini/english_wiki_2018_index.zip
Connecting to 72.143.107.253:80... failed: Connection timed out.
Retrying.

Error when running the example in README

Hi! I run the sample in README using Option2 (the local context), with Transformers 4.16.2.
I encountered an Error: TypeError: TextInputSequence must be str
It happened in line candidates = bert_reader.predict(question, contexts). Could someone help with it? Thank you a lot in advance.

Can't train!

TypeError: TextInputSequence must be str
I got this error while training with my own dataset, this happen to Squad dataset as well, any idea how to fix this?
Thanks!
image

Take advantage of pyserini's new prebuilt index features

We can now do this in pyserini

>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher.from_prebuilt_index('trec45')
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...
index-robust04-20191213.tar.gz: 1.70GB [00:50, 36.4MB/s]                                                                                                                                                                        
Extracting /Users/jimmylin/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /Users/jimmylin/.cache/pyserini/indexes/index-robust04-2019121315f3d001489c97849a010b0a4734d018...
>>> searcher
<pyserini.search._searcher.SimpleSearcher object at 0x7fee58547ac8>
>>> hits = searcher.search('hubble space telescope')
>>> 
>>> # Print the first 10 hits:
... for i in range(0, 10):
...     print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
... 
 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920

Instead of downloading the indexes by hand, take advantage of this feature?

cc/ @MXueguang @qguo96

Question about retriever..

A silly question about the retriever part:

I was trying to index the Wikipedia dump in paragraph-level as you did. And in the paper, you mentioned that you got 29.5M paragraphs, but instead I got 33.3M paragraphs. So, I would like to ask if you did some special filter setting when you split the article into paragraphs or just easily split them by article.split("\n")

Thanks

How to build index for Chinese corpus?

Now i want to try to build index for Chines corpus other than your pre-built Chinese Wiki index. From your document, i should use Anserini's script bin/IndexConnection. But i think that script is only used to build index for English corpus. Is there any way to build index for Chinese corpus? Thanks a lot!

Code freeze on inference

Hi,
When I run the inference code , it freezes almost each time. No error comes, but the code freezes randomly.
Is this a common issue? Any suggestions on how to fix this?
Thanks!

dependencies update

I was installing and running the code on Google Colab and ran into some dependency issue.

  • utils.py uses zhon which is not in the requirement.txt file yet. So i had to install it manually
  • torch is reinstalled, from 1.8.1 back to 1.5.1, which take more time and resources

So please consider updating the requirement.txt file

lucene 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException when using self created corpus

Hello at all.

I tried to use Bertserini for question answering with a self created corpus. The base example works perfect (with transformers == 3.4.0), but I am not able to find a solution for the lucene problem. I know Bertserini depends on lucene 8 while pyserini switched to lucene 9 in its latest version, so I installed https://pypi.org/project/pyserini/0.16.0/ on a separate conda environment, created a new index with it, but the problem stays the same.

When I tried to build an index with the pyserini version I got from installing bertserini I am stopped by “/home/user/anaconda3/envs/bertserini/bin/python: No module named pyserini.index.lucene“, Only solution i found for that upgrading pyserini which isn‘t an option because of the base bertserini problem.

Is there any easy way around? And sorry if this is a stupid question, but as a psychologist I have a rather weak informatic background knowledge.

edit1:
forgot to mention which command I used to create the index
python -m pyserini.index.lucene
--collection JsonCollection
--input tests/resources/sample_collection_jsonl
--index indexes/sample_collection_jsonl
--generator DefaultLuceneDocumentGenerator
--threads 1
--storePositions --storeDocvectors --storeRaw

replicates evaluation scores

Hi, I am trying to replicate the results by following README

Does ## BERT-large-wwm-uncased and ## BERT-base-uncased mentioned in evaluation results same to rsvp-ai/bertserini-bert-large-squad and rsvp-ai/bertserini-bert-base-squad respectively?

The evaluation results in my run gives:
rsvp-ai/bertserini-bert-large-squad

(0.4, {'exact_match': 41.54210028382214, 'f1': 49.45378799697662, 'recall': 51.11983858400310
5, 'precision': 49.8395951713666, 'cover': 47.228003784295176, 'overlap': 57.6631977294229})

which is different from the score under ## BERT-large-wwm-uncased

Results for rsvp-ai/bertserini-bert-base-squad was same to the scores under ## BERT-base-uncased

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.