castorini / pygaggle Goto Github PK

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini

License: Apache License 2.0

Python 33.64% Shell 0.21% Jupyter Notebook 66.15%

pygaggle's Introduction

PyGaggle

PyGaggle provides a gaggle of deep neural architectures for text ranking and question answering. It was designed for tight integration with Pyserini, but can be easily adapted for other sources as well.

Currently, this repo contains implementations of the rerankers for MS MARCO Passage Retrieval, MS MARCO Document Retrieval, TREC-COVID and CovidQA.

Installation

Clone the repo with git clone --recursive https://github.com/castorini/pygaggle.git
Make you sure you have an installation of Python 3.8+. All python commands below refer to this.
For pip, do pip install -r requirements.txt
- If you prefer Anaconda, use conda env create -f environment.yml && conda activate pygaggle.

A Simple Reranking Example

Here's how to initalize the T5 reranker from Document Ranking with a Pretrained Sequence-to-Sequence Model:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

reranker =  MonoT5()

Alternatively, here's the BERT reranker from Passage Re-ranking with BERT, which isn't as good as the T5 reranker:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoBERT

reranker =  MonoBERT()

Either way, continue with a complete reranking example:

# Here's our query:
query = Query('who proposed the geocentric theory')

# Option 1: fetch some passages to rerank from MS MARCO with Pyserini
from pyserini.search import LuceneSearcher
searcher = LuceneSearcher.from_prebuilt_index('msmarco-passage')
hits = searcher.search(query.text)

from pygaggle.rerank.base import hits_to_texts
texts = hits_to_texts(hits)

# Option 2: here's what Pyserini would have retrieved, hard-coded
passages = [['7744105', 'For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.'], ['2593796', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.he geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.'], ['6217200', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.opernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.'], ['3276925', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.Simple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect.ou might want to check out one article on the history of the geocentric model and one regarding the geocentric theory. Here are links to two other articles from Universe Today on what the center of the universe is and Galileo one of the advocates of the heliocentric model.'], ['6217208', 'Copernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.Simple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect.opernicus proposed a heliocentric model of the solar system â\x80\x93 a model where everything orbited around the Sun. Today, with advancements in science and technology, the geocentric model seems preposterous.'], ['4280557', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.imple tools, such as the telescope â\x80\x93 which helped convince Galileo that the Earth was not the center of the universe â\x80\x93 can prove that ancient theory incorrect. You might want to check out one article on the history of the geocentric model and one regarding the geocentric theory.'], ['264181', 'Nicolaus Copernicus (b. 1473â\x80\x93d. 1543) was the first modern author to propose a heliocentric theory of the universe. From the time that Ptolemy of Alexandria (c. 150 CE) constructed a mathematically competent version of geocentric astronomy to Copernicusâ\x80\x99s mature heliocentric version (1543), experts knew that the Ptolemaic system diverged from the geocentric concentric-sphere conception of Aristotle.'], ['4280558', 'A Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth. Geocentric theory is an astronomical theory which describes the universe as a Geocentric system, i.e., a system which puts the Earth in the center of the universe, and describes other objects from the point of view of the Earth.'], ['3276926', 'The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed to explain how the planets, the Sun, and even the stars orbit around the Earth.ou might want to check out one article on the history of the geocentric model and one regarding the geocentric theory. Here are links to two other articles from Universe Today on what the center of the universe is and Galileo one of the advocates of the heliocentric model.'], ['5183032', "After 1,400 years, Copernicus was the first to propose a theory which differed from Ptolemy's geocentric system, according to which the earth is at rest in the center with the rest of the planets revolving around it."]]

texts = [ Text(p[1], {'docid': p[0]}, 0) for p in passages] # Note, pyserini scores don't matter since T5 will ignore them.

# Either option, let's print out the passages prior to reranking:
for i in range(0, 10):
    print(f'{i+1:2} {texts[i].metadata["docid"]:15} {texts[i].score:.5f} {texts[i].text}')

# Finally, rerank:
reranked = reranker.rerank(query, texts)

# Print out reranked results:
for i in range(0, 10):
    print(f'{i+1:2} {reranked[i].metadata["docid"]:15} {reranked[i].score:.5f} {reranked[i].text}')

Reranking with a different checkpoint

There are many checkpoints for monoBERT and monoT5 in our Hugging Face model page: https://huggingface.co/castorini

The MonoT5() class uses castorini/monot5-base-msmarco by default. In the example below, we show how to use a different checkpoint (i.e., castorini/monot5-base-msmarco-10k):

from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco-10k')
reranker = MonoT5(model=model)

Experiments on IR collections

The following documents describe how to use PyGaggle on various IR test collections:

Experiments on QA collections

The following documents describe how to use PyGaggle for QA:

Experiments on Natural Questions using the Dense Passage Retrieval (DPR) Reader - with GPU

pygaggle's People

Contributors

Stargazers

Watchers

Forkers

dragomirradev antonpolishko test00dezwebsite kelvin-jiang anushkrishnav mistobaan yana-xuyan aivscovid19 justinborromeo hangcui0510 lizzyzhang-tutu mrkarezina narabzad yuxuan-ji wiltan-uw qguo96 estella98 valaydave metrofun stephaniewhoo rayyang29 halulu10 rakeeb-hossain kairanithin021 dahlia-chehata kairanithin kaisun314 wongalvis14 shamanez mxueguang amal2nes keleog arthurchen189 cronopioelectronico saileshnankani larryli1999 andrewyguo saxenamansi sinpex-gmbh ii-research-yu hugoabonizio anshiquanshu66 thiagolaitz leobavila arvinzhuang mayankanand007 lhbonifacio techthiyanes rogervaas mzzchy leungjch vjeronymo2 luanps edanerg pedrogengo muziyida robertmarton drorpa mam10eks oguaschi huyennguyenhelen manveertamber nimasadri11 lingwei-gu elfsong markussagen hlamba-dm amyxie361 yurenee joelrorseth softgitron zanezzephyrs jx3yang alvind1 crystina-z lc10230327 1basile alinajafi1998 yikee toluclassics zeaplau aileenlin farazkh80 theyorubayesian qingqinggit1 spongeorge samzkr aivan6842 ahlag standardgalactic tdhooghe verdurechen alimt1992 zaid-marji

pygaggle's Issues

Machine Specifications/Requirements?

I am trying to run the examples on a colab-pro notebook and the SciBERT model is crashing due to memory.

What are the specifications of the machine you are using?
Maybe is the torch version?

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 191, in <module>
    main()
  File "/content/pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 184, in main
    for metric in evaluator.evaluate(examples):
  File "/content/pygaggle/pygaggle/model/evaluate.py", line 162, in evaluate
    example.documents)]
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/content/pygaggle/pygaggle/rerank/transformer.py", line 89, in rerank
    score = self.methods[self.method](matrix) if matrix.size(1) > 0 \
  File "/content/pygaggle/pygaggle/rerank/transformer.py", line 55, in <lambda>
    methods = dict(max=lambda x: x.max().item(),
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

transformers 2.11.0

MonoBERT and MonoT5 defaults

Follow up to #83, we have:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

model_name = 'castorini/monot5-base-msmarco'
tokenizer_name = 't5-base'
reranker =  MonoT5(model_name, tokenizer_name)

I think the model_name and tokenizer_name should have defaults? So we can boil down to:

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5

reranker =  MonoT5()

Also, the regression scripts should be modified to use this new abstraction?

PARADE replication in Capreolus

URAs and others interested in PyGaggle should also take a look at Capreolus, which is another neural IR toolkit that our group contributes to: https://github.com/capreolus-ir/capreolus

See https://dl.acm.org/doi/10.1145/3336191.3371868

Capreolus is a much more full-featured toolkit and has a lot more models.

It was developed before PyGaggle. PyGaggle sprang into being because we needed something quick-and-dirty for our TREC-COVID experiments.

The relationship between PyGaggle and Capreolus is evolving, but for certain experiments that we want to run, it makes much more sense to start with Capreolus.

For one, Capreolus has PARADE, which I'm a big fan of. We have some ideas on follow-up experiments, so replicating the results might be a start: https://github.com/capreolus-ir/capreolus/blob/master/docs/reproduction/PARADE.md

evaluate/msmarco/msmarco_eval.py should be moved to anserini-eval repo

https://github.com/castorini/anserini-eval

Error when running monoBert experiment

after running the command srun --mem=64G --cpus-per-task=2 --time=24:0:0 --gres=gpu:v100l:2 --pty bash to get GPU access and enter interactive mode, the following error occurs when trying to install pyggagle using the command pip3 install -r requirements.txt ( I have tried both pip and pip3, both failed. )

Failed building wheel for pyjnius
Running setup.py clean for pyjnius
Failed to build pyjnius
Installing collected packages: pydantic, scipy, threadpoolctl, scikit-learn, pytz, python-dateutil, pandas, Cython, pyjnius, pyserini, spacy, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, zipp, importlib-metadata, markdown, werkzeug, absl-py, grpcio, tensorboard-plugin-wit, tensorboard, tokenizers, tqdm, transformers
Found existing installation: pydantic 1.7.2
Uninstalling pydantic-1.7.2:
Successfully uninstalled pydantic-1.7.2
Rolling back uninstall of pydantic
Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: '/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/site-packages/pydantic-1.5.dist-info'

I tried to install it individually using the command pip3 install --user pyjnius which successfully installed cython-0.29.21 pyjnius-1.3.0 for me

I then continued with another pip3 install -r requirements.txt, getting following error:

Installing collected packages: pytz, python-dateutil, pandas, scipy, threadpoolctl, scikit-learn, pyserini, spacy, cachetools, pyasn1, pyasn1-modules, rsa, google-auth, oauthlib, requests-oauthlib, google-auth-oauthlib, werkzeug, tensorboard-plugin-wit, absl-py, grpcio, zipp, importlib-metadata, markdown, tensorboard, tokenizers, tqdm, transformers
Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: '/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/site-packages/pytz-2020.4.dist-info'

which I fixed using this new command pip3 install --user -r requirements.txt

the steps above resolved the problems in pygaggle installation, next, when actually running the monoBert experiment, I got:

ImportError: cannot import name 'AutoModel',

which I solved by `pip3 install --user torch'.

No module named 'thinc', 'catalogue'

, which I solved by pip3 install them individually.

I ended up getting stuck at here:

Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.6.3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/scratch/o3liu/pygaggle/pygaggle/run/evaluate_passage_ranker.py", line 16, in
from pygaggle.rerank.transformer import (
File "/scratch/o3liu/pygaggle/pygaggle/rerank/transformer.py", line 11, in
from .similarity import SimilarityMatrixProvider
File "/scratch/o3liu/pygaggle/pygaggle/rerank/similarity.py", line 5, in
from pygaggle.model.encode import SingleEncoderOutput
File "/scratch/o3liu/pygaggle/pygaggle/model/init.py", line 6, in
from .encode import *
File "/scratch/o3liu/pygaggle/pygaggle/model/encode.py", line 8, in
from .tokenize import BatchTokenizer
File "/scratch/o3liu/pygaggle/pygaggle/model/tokenize.py", line 5, in
from spacy.lang.en import English
File "/home/o3liu/.local/lib/python3.6/site-packages/spacy/init.py", line 12, in
from . import pipeline
File "/home/o3liu/.local/lib/python3.6/site-packages/spacy/pipeline/init.py", line 4, in
from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
File "pipes.pyx", line 1, in init spacy.pipeline.pipes
File "gold.pxd", line 18, in init spacy.syntax.nn_parser
File "syntax/transition_system.pxd", line 38, in init spacy.gold
File "search.pxd", line 37, in init spacy.syntax.transition_system
ValueError: thinc.extra.search.Beam size changed, may indicate binary incompatibility. Expected 120 from C header, got 112 from PyObject

Annotated Q-A pairs ?

Where can I find the annotated Q-A pairs. The given JSON files consist of multiple answers for a single question. What should I use?

Add mono models to huggingface model-zoo and incorporate into pipeline

As pointed out in hedwig, Huggingface's model-zoo provides a simple way of loading models without having for us to manually download models each time we use it. Besides, adding that would also allow us to track how many times our models are being used. Hence, this will be a great feature to incorporate into pygaggle.

Begin with monoBERT-large (should be simple) (Name choice - monobert-large-msmarco) and then figure out how to do the same with monoT5 (monot5-base-msmarco).

Pertained T5 Reranker

Hi ,

Do u have the pretrained T5 ranker uploaded to hugging face repo. or any other file sharing site

Thanks

Separate README in repo from README on PyPI

See https://github.com/castorini/pyserini/blob/master/setup.py#L3

I think it's a good idea to keep the README on PyPI separate from (and simpler than) the README in the repo. That way, we also don't have weird references to the repo itself in the repo README (so that you can point the PyPI README to the repo).

Prepare for version='0.0.2'

I think an update is long overdue, especially since we've added a lot of stuff, as well as the new CovidQA dataset.

Complete monoBERT and monoT5 replication on ComputeCanada

If URAs use ComputeCanada resources, how long would it take to replicate full results on MS MARCO passage?

I think it'd be nice to have a full leaderboard replication. Esp. with monoT5, since it's still reasonably competitive wrt SOTA.

Implement Birch/MaxP variant

We should probably implement our own variant of Birch/BERT-MaxP in PyGaggle -

Birch: https://www.aclweb.org/anthology/D19-1352/
BERT-MaxP: https://dl.acm.org/doi/10.1145/3331184.3331303

Let's call it MonoBirchP :)

update-index.sh not working

The file update-index.sh contains a reference to a dropbox url that doesn't exist:
INDEX_URL=${2:-https://www.dropbox.com/s/s3bylw97cf0t2wq/lucene-index-cord19-paragraph-2020-05-12.tar.gz}

That script is referenced in the installations instructions in the README file. However it doesn't look a required step except when working with the CORD-19 dataset.
Maybe it could be changed by some test index (smaller than CORD-19 or MS MARCO).

Duplicated answer in the QA dataset

First, great work with the QA dataset and thanks for sharing!

I found the following answer is (wrongly) duplicated in the dataset:

"id":"o56j4qio",
"title":"Journal Pre-proof Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis Prevalence of comorbidities in the Novel Wuhan Coronavirus (COVID-19) infection: a systematic review and meta-analysis",
"exact_answer":"OR 2.07, 95% CI: 0.89-4.82"

Also, it might be helpful to specify that the ID refers to the context article rather than identifying uniquely the answer.

Hope this helps to review the dataset for future versions.

incompatibility of Pyserini version

Hello,

As it is said in Pygaggle library, the requirement for Pygaggle is pyserini 0.9.4.0. However there is a line in pygaggle that is incompatible with pyserini 0.9.4.0 i.e.,.

pygaggle/pygaggle/rerank/base.py :

from pyserini.pyclass import JSimpleSearcherResult

It should be

from pyserini.search import JSimpleSearcherResult

to be compatible with pyserini latest version. Is this correct ?

Error on model name

Hi,
I am using MonoT5 reranker to rerank my own collection. I am trying to use non-default models, like this:

model_name = 'castorini/monot5-3b-med-msmarco'
tokenizer_name = 't5-3b'
reranker =  MonoT5(model_name, tokenizer_name)

However I am getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/src/app/pygaggle/pygaggle/rerank/transformer.py", line 33, in __init__
    self.device = next(self.model.parameters(), None).device
AttributeError: 'str' object has no attribute 'parameters'

I think that this is a version related issue with some package, but I am not sure which one is causing the problem (I am using transformers==2.10.0 and I have also tried recently updated transformers==4.0.0).

I would appreciate any help,
Marcos

Add Jupyter or Colab notebooks

Title says it all.

Fine-tunning

Hi, I was wondering whether the library supports only a zero-shot approach for the new data or fine-tuning on my dataset could be performed as well?

Conda -- Solving environment: failed

At head commit c1a54cb:

Per instructions:

$ conda env create -f environment.yml && conda activate pygaggle
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - libgfortran-ng=7.3.0
  - libstdcxx-ng=9.1.0
  - libgcc-ng=9.1.0

update transformers dependency to latest transformers==4.0.0

current dependency transformers==2.10.0 is a bit outdated.

updating to transformers==3.4.0

conflicts are fixed already.
will create PR when monoT5 & monoBert's results are replicated on my end.

following warnings may need to be considered to update as well

Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
/u5/x93ma/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1938: FutureWarning: The pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True or padding='longest' to pad to the longest sequence in the batch, or use padding='max_length' to pad to a max length. In this case, you can give a specific length with max_length (e.g. max_length=45) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
warnings.warn(
/u5/x93ma/anaconda3/lib/python3.8/site-packages/transformers/tokenization_t5.py:176: UserWarning: This sequence already has </s>. In future versions this behavior may lead to duplicated eos tokens being added.

Warnings when I load models

Hello,when i load t5 models,there is such a warning, but it has no impact on the next operation. could you tell me the reason for this?
model = MonoT5.get_model('castorini/monot5-3b-med-msmarco') tokenizer = MonoT5.get_tokenizer('t5-3b') reranker = MonoT5(model, tokenizer)

Some weights of the model checkpoint at castorini/monot5-3b-med-msmarco were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']

This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

pygaggle can't be installed in Colab?

When i pip install pygaggle in colab,it could not find a version that satisfies the requirement pygaggle.The python version of colab is 3.6+,however,pygaggle need python version 3.7+. Are there any good ways to install pygaggle in colab?

Simplify monoT5 and monoBERT boilerplate

There's a lot of boilerplate here: https://github.com/castorini/pygaggle#a-simple-reranking-example

Can we fold all of that into the constructor of the class? E.g., so we're left with:

reranker =  monoT5()

reranker =  monoBERT()

Make model_name, tokenizer_name, etc. configurable with sensical defaults.

So simple reranking gets boiled down to

from pyserini.search import SimpleSearcher
from pygaggle.rerank.base import hits_to_texts

query = Query('who proposed the geocentric theory')
searcher = SimpleSearcher('/path/to/msmarco/index/')
reranker = monoBERT()

hits = searcher.search(query.text)
reranked = reranker.rerank(query, hits_to_texts(hits))
reranked.sort(key=lambda x: x.score, reverse=True)

@rodrigonogueira4 @rodrigonogueira4 thoughts?

Add the beginnings of the leaderboard

Populate w/ entries from the paper.

Create documentation page for MS MARCO passage replication

This would be the pygaggle counterpart of: https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md

Let's name it docs/experiments-msmarco-passage.md to be consistent with Anserini. We'll add mutual references.

Let's start with monoBERT. Goal is for someone to be able to replicate monoBERT results following this doc. (At least on a subset.)

Plan is to run potential URAs through this document.

Automate downloading of Pyserini indexes

Currently we have in README:

query = Query('who proposed the geocentric theory')

# Option 1: fetch some passages to rerank from MS MARCO with Pyserini
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('/path/to/msmarco/index/')
hits = searcher.search(query.text)

from pygaggle.rerank.base import hits_to_texts
texts = hits_to_texts(hits)

...

If we automate index downloading, per castorini/pyserini#225 - it would make replication even easier.

Decommission numbert

Hey @ronakice we should probably decommission the numbert repo?
https://github.com/castorini/numbert

Add a note to the README saying it's defunct, pointing to pygaggle? I'll archive it and make it read only on GitHub?

[CovidQA Experiment] ModuleNotFoundError: No module named 'pyserini.analysis.pyanalysis'

I'm trying to replicate the CovidQA experiment and am receiving the following error when running the following command:

$ python -um pygaggle.run.evaluate_kaggle_highlighter --method random 
\
>                                                     --dataset data/kaggle-lit-review-0.2.json \
>                                                     --index-dir indexes/lucene-index-cord19-paragraph-2020-05-12
Traceback (most recent call last):
  File "/home/justin/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/justin/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/justin/Justin/School/data-systems-group/pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 16, in <module>
    from pygaggle.rerank.bm25 import Bm25Reranker
  File "/home/justin/Justin/School/data-systems-group/pygaggle/pygaggle/rerank/bm25.py", line 6, in <module>
    from pyserini.analysis.pyanalysis import get_lucene_analyzer, Analyzer
ModuleNotFoundError: No module named 'pyserini.analysis.pyanalysis'

I'm using Anserini 0.9.5, Pyserini 0.9.4, and PyTorch 1.4.0. Is there something I'm missing?

Also, the same error occurs when reranking with MonoT5 from the MSMarco-Document experiment.

Replication results

Results on Hydra
CUDA 10.0
Python 3.7.7
Pytorch 1.5.0

Table 1 refers to the table in page 5 of "Rapidly Bootstrapping a Question Answering Dataset for COVID-19" https://arxiv.org/pdf/2004.11339.pdf

BM25: (matches Table 1)
precision@1 0.15
recall@3 0.2163978494623656
recall@50 0.619758064516129
recall@1000 0.6318548387096774
mrr 0.24284268136968856
mrr@10 0.22115655401945727

BERT: (matches Table 1)
precision@1 0.08064516129032258
recall@3 0.1172043010752688
recall@50 0.6460061443932411
recall@1000 1.0
mrr 0.15878306593763689
mrr@10 0.13760880696364566

T5 (fine-tuned on MS MARCO): (does not match Table 1, misses about 1%)
precision@1 0.27419354838709675
recall@3 0.43502304147465437
recall@50 0.9305683563748081
recall@1000 1.0
mrr 0.4224002621206025
mrr@10 0.4097638248847927

Add duoBERT

I was looking into adding duo support (k-way ranking potentially) for MSMARCO and TREC-CAR.

For RelevanceExample, I was wondering if we can have documents as Union(List[Text], List[List[Text]]) instead of just List[Text]. I could also do something like this in the evaluate method in RerankerEvaluator. Another way is to add an is_duo argument to the various reranks/class.

Thoughts? @daemon @rodrigonogueira4

Move CovidQA experiments into separate page

Instead of directly at top-level README, move into a separate page, like https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage.md so we can also start a replication log.

@ronakice @daemon thoughts?

Try fixing MS MARCO encoding issues before pointwise re-ranking

See if fixing encoding issues before re-ranking with monoT5/monoBERT has any effect on MRR@10 / Recall scores

Related work in first-stage ranking: 1 and 2.

Start with smaller subsets, report, and scale up to the entire dev set.

GPU by default?

Hi! First of all congrats for yout outstanding work!
I am using your pretrained models to rerank a custom collection, as exposed in your section A Simple Reranking Example. However, I would like to know if your T5 model fine tuned with MS MARCO uses any GPU available device by default to perform its predictions and reranking. If not I would appreciate any indication on how to achieve this.
Best,
Marcos

Training New Reranker And Benchmarking

Hello,

Great work with a suite of libraries around pygaggle !. Really loved your work. It took me so little time to start work and experimenting on the doc-ranking task.

I have created a new model to see experiments with the Reranker. I wanted to use MS Marco Dataset and see MRR.

I see that this file contains the main script to run for the Marco dataset.

Can you provide a way to extend this module to bring in Outside models for testing and Benchmarking?

Flatten the structure of the ground truth JSON

For v0.0.2 of the dataset, should we flatten the nested hierarchy?

Currently, we have subcategories nested within categories. I propose to undo the nesting and just have something like:

[{
   category: ...
   subcategory: ...
}, {

}
...]

That way, when we start integrating other sources, we wouldn't need to jam things into this rigid structure.

Thoughts?

Pygaggle for reranking msmacro passages without ground-truth labels

Hi,
Could you please advice on how to rerank msmacro passages without evaluation with monbert, duobert, and T5 using pygaggle?
I followed the instructions on the link below, and it seems providing qrels is not optional.
https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage.md#re-ranking-with-monobert

Improve BioBERT MARCO and SQuAD usability

Make the T5CachedModelLoader class more general to include *BERT -- should just be a few lines.

add DPR reader

For now, Pygaggle is for reranking only right?

As we are adding DPR reader, I am thinking about adding a new module for Reader.
So make pygaggle as a mix of reranker and reader.

@KaiSun314
Basically, what pygaggle doing rn is use a Reranker to rerank a list of Text for a Query. https://github.com/castorini/pygaggle/blob/master/pygaggle/rerank/base.py

To implement the Reader module, we want a Reader to read from a list of Context for a Question and gives an Answer, similar to what we did in bertserini (https://github.com/rsvp-ai/bertserini/blob/development/bertserini/reader/base.py)

I guess here, we can use the Text class as Context and Query for Question

so concretely, we need to define a data structure class for Answer and an abstract class Reader. The DPRReader should implement the Reader.

When implementing the DPRReader, we can use the API from transformers directly, https://huggingface.co/transformers/model_doc/dpr.html#dprreader, which will give us the logits for start/end.
To get the answer span from the logits, we can learn from how the dpr repo did here

CC: @lintool @ronakice

--model-name-or-path

I just noticed this... srsly? How about just --model, and leave it in the description that it can either refer a name or a path?

On behalf of all future users who I've just saved a bunch of typing... you're welcome.

Writing replication guide to train monoBERT from scratch

We probably want to write this against Compute Canada resources - all Waterloo students can have access to it.

Once we get this set up, we can create the parallelism for lots of simple experiments - like all the various BERTology experiments we talked about...

Can't run the code in the readme

Hi everyone,

I'm kind of new to BERT and T5, but I would like to test them using pygaggle. I installed pygaggle inside a conda environment. But when I try to execute the following code

from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoBERT

reranker = MonoBERT()

I got the following core dump.

2020-11-24 15:47:13 [INFO] file_utils: PyTorch version 1.7.0 available.

A fatal error has been detected by the Java Runtime Environment:

SIGILL (0x4) at pc=0x00007fb9c84a4fb1, pid=20726, tid=20726

JRE version: OpenJDK Runtime Environment (11.0.8) (build 11.0.8-internal+0-adhoc..src)

Java VM: OpenJDK 64-Bit Server VM (11.0.8-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)

Problematic frame:

C [*pywrap_tensorflow_internal.so+0xc847fb1] nsync::nsync_mu_init(nsync::nsync_mu_s**)+0x1

Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h" (or dumping to /xxxxxx/test-pygaggle/core.20726)

An error report file with more information is saved as:

/xxxxxx/test-pygaggle/hs_err_pid20726.log

If you would like to submit a bug report, please visit:

https://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

/var/spool/slurm/slurmd.spool/job1290570/slurm_script: line 9: 20726 Aborted (core dumped) python main.py

So are there any hints for how to solve this issue?

CovidQA Replication Issues

I tried replicating CovidQA experiments on Colab (Tesla K80), Compute Canada (Tesla V100) and locally (GTX1650) on #134 with the index from https://www.dropbox.com/s/z8s0urul6l4zig2/lucene-index-cord19-paragraph-2020-05-12.tar.gz?dl=1

Re-Ranking with Random (I got the same results)

python -um pygaggle.run.evaluate_kaggle_highlighter --method random --dataset data/kaggle-lit-review-0.2.json --index-dir indexes/lucene-index-cord19-paragraph-2020-05-12

precision@1	0.0
recall@3	0.0199546485260771
recall@50	0.3247165532879819
recall@1000	1.0
mrr	0.03999734528458418
mrr@10	0.020888672929489253

python -um pygaggle.run.evaluate_kaggle_highlighter --method random --split kq --dataset data/kaggle-lit-review-0.2.json --index-dir indexes/lucene-index-cord19-paragraph-2020-05-12

precision@1	0.0
recall@3	0.0199546485260771
recall@50	0.3247165532879819
recall@1000	1.0
mrr	0.03999734528458418
mrr@10	0.020888672929489253

Re-Ranking with BM25

I got the following error on the three machines:

File "/pygaggle/pygaggle/model/evaluate.py", line 161, in evaluate
    scores = [x.score for x in self.reranker.rerank(example.query,
File "/pygaggle/pygaggle/rerank/bm25.py", line 46, in rerank
    idfs = {w:
File "/pygaggle/pygaggle/rerank/bm25.py", line 48, in <dictcomp>
    text.metadata['docid'], w) for w in tf}
KeyError: 'docid'

I replaced text.metadata['docid'] by text.title['docid'] in /pygaggle/pygaggle/rerank/bm25.py and got the same results for the 2 commands:

python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25 --dataset data/kaggle-lit-review-0.2.json --index-dir indexes/lucene-index-cord19-paragraph-2020-05-12

precision@1	0.15384615384615385
recall@3	0.21865889212827985
recall@50	0.7208778749595076
recall@1000	0.7582928409459021
mrr	0.25329970378011524
mrr@10	0.23344131303314977

python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25 --split kq --dataset data/kaggle-lit-review-0.2.json --index-dir indexes/lucene-index-cord19-paragraph-2020-05-12

precision@1	0.15384615384615385
recall@3	0.21865889212827985
recall@50	0.7208778749595076
recall@1000	0.7582928409459021
mrr	0.25441237140238665
mrr@10	0.23493413238311195

Re-Ranking with monoT5

I tried with Python 3.6.9, 3.7.3 and 3.8 with the corresponding requirements and got the following error in all cases:
(Besides changing torch version, I have not tried looking into this error)

 File /pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 193, in <module>
    main()
  File "/pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 182, in main
    reranker = construct_map[options.method](options)
  File "/pygaggle/pygaggle/run/evaluate_kaggle_highlighter.py", line 81, in construct_t5
    model = loader.load().to(device).eval()
  File "/pygaggle/pygaggle/model/serialize.py", line 76, in load
    return self._fix_t5_model(T5ForConditionalGeneration.from_pretrained(
  File "/pygaggle/pygaggle/model/serialize.py", line 34, in _fix_t5_model
    model.decoder.block[0].layer[1].EncDecAttention.\
  File ".../.../torch/nn/modules/module.py", line 778, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'T5Attention' object has no attribute 'relative_attention_bias'

Score ranges

Hi,

This is a question rather than an issue. I was wondering what is the expected range of Pygaggle score?

From the paper (Document Ranking with a Pretrained Sequence-to-Sequence Model) it comes across as if it should be the probability of a relevant label obtained from softmax layer, so [0,1]. However, when I am using it on my dataset (not from the paper) with Pyserini as described in your main readme and getting scores from reranked[i].score in the range [-1.47, -0.04]. Could you explain why this is happening? Is it as intended?

Also, can I use these pygaggle scores to identify relevance threshold and filter out pairs with low scores as irrelevant across different datasets?

Many thanks,
Elena

seems like no reranking happens

Hello,
When I try to run the given example. The documents will show up in the same order and no reranking happens. I tried with the two models and with different document samples. The order will not change at all.
So if I run the given example in the last section of the readme, should I get reranked passages? Why am I not seeing any differences in documents order?

PyGaggle replication on Colab

Add instructions for replicating MS MARCO on Colab (for users without local GPU resources):
https://github.com/castorini/pygaggle/blob/master/docs/experiments-msmarco-passage.md

... as well as the replication guides.

Create GitLab for PyGaggle's intermediate and output files

Bunch of experiments but we're not really storing any of these files, we should so others can use these scores and intermediate files for each of our experiments easily without having to re-run on TPUs, etc.

Reranking example from the README works no longer

Hi everyone,

The simple reranking example from the README fails to run anymore.
On running the code, this is the error I receive

Traceback (most recent call last):
  File "rerank.py", line 15, in <module>
    print(f'{i+1:2} {texts[i].metadata["docid"]:15} {texts[i].score:.5f} {texts[i].text}')
TypeError: 'int' object is not subscriptable

Inspecting the 1st Text object, I see the following output

{
   "text": "For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory.",
   "metadata": 0,
   "score": 0,
   "title": {
      "docid":"7744105"
   }
}

It turns out that following the commit 6638f6e, the class Text from the base.py takes 'title' as an additional input.

So, the example in the README needs to be modified to the following
texts = [ Text(p[1], "", {'docid': p[0]}, 0) for p in passages]

Can you please update the example in the README?

Thanks!

Attribute Error

Hi, I am using MonoT5 model to rerank my own collection. However I am getting the following error:

Traceback (most recent call last):
  File "pygaggle/passaget5.py", line 100, in <module>
    reranked_total = reranker.rerank(query_total, texts)
  File "/usr/src/app/pygaggle/pygaggle/rerank/transformer.py", line 60, in rerank
    return_last_logits=True)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/usr/src/app/pygaggle/pygaggle/model/decode.py", line 27, in greedy_decode
    use_cache=True)
TypeError: prepare_inputs_for_generation() missing 1 required positional argument: 'encoder_outputs'
Exception ignored in: <bound method Buckets.__del__ of <tensorflow.python.eager.monitoring.ExponentialBuckets object at 0x7f4864fa5108>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/monitoring.py", line 407, in __del__
AttributeError: 'NoneType' object has no attribute TFE_MonitoringDeleteBuckets

My code is the following, and I have checked that query_total and texts variables are non-empty:

doc_passages = [p for p in doc_passages if p]
texts = [ Text(p, None, 0) for p in doc_passages]
               
query_total = Query(hand_expression_total)
print("Query total", query_total)
print("Texts", texts[0])
reranked_total = reranker.rerank(query_total, texts)
reranked_total.sort(key=lambda x: x.score, reverse=True)

"Install" anserini-eval submodule

I've created: https://github.com/castorini/anserini-eval

Someone "install" in this repo, per castorini/anserini#1216 ?

ref: castorini/pyserini#145

Add version info in CovidQA dataset

The dataset should have its version somewhere inside it. Also, the version should be in the file name.

Once release, the file becomes immutable. We make copies as we improve.