texttron / tevatron Goto Github PK

View Code? Open in Web Editor NEW

442.0 10.0 88.0 20.74 MB

Tevatron - A flexible toolkit for neural retrieval research and development.

Home Page: http://tevatron.ai

License: Apache License 2.0

Python 98.82% Shell 1.18%

pytorch transformer dense-retrieval dpr information-retrieval question-answering jax flax

tevatron's Introduction

Tevatron V2

Tevatron aims to provide a flexible and efficient toolkit that enables training and inference for neural retrieval models at scale.

Some of the features in Tevatron v1 is not yet migrated to Tevatron v2. We are working on it. If you are looking for the Tevatron v1 features, please pull the v1 branch.

Features

Training billion-scale LLM neural retriever on GPUs and TPUs.
Parameter efficient tuning with LoRA.
Integration with DeepSpeed, flash attention, gradient accumulation, and other efficient training techniques.
Self-contained datasets for neural retrieval and open-domain QA tasks.
Direct loading and finetuning SoTA pre-trained models (BGE-Embbedding, Instruct-E5) from HuggingFace.

Installation

PyTorch (GPU)

Clone the repository.
Install PyTorch based on your CUDA version from PyTorch.
Install dependencies and Tevatron.

pip install transformers datasets peft
pip install deepspeed accelerate
pip install faiss
pip install -e .

JAX (TPU)

Clone the repository.
Install JAX by following the official guide
Install dependencies

pip install transformers datasets
pip install flax optax

Install Magix and GradCache

git clone https://github.com/luyug/magix.git
cd magix && pip install -e . && cd ..
git clone https://github.com/luyug/GradCache.git
cd GradCache && pip install -e . && cd ..

Install Tevatron

pip install -e .

JAX (GPU)

To run the JAX implementation of Tevatron on GPU, we encourage using the jax-toolbox jax container image from NVIDIA.

Below is a Dockerfile example to set up Tevatron on top of the jax container.

FROM ghcr.io/nvidia/jax:jax-2024-03-08

RUN apt-get update && \
    apt-get install -y --no-install-recommends python3-pip && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    pip install --no-cache-dir transformers sentencepiece simple_parsing datasets orbax==0.4.8 && \
    pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu

RUN git clone https://github.com/luyug/magix.git && \
    cd magix && pip install -e . && cd .. && \
    git clone https://github.com/luyug/GradCache.git \
    cd GradCache && pip install -e . && cd .. \
    git clone https://github.com/texttron/tevatron.git && \
    cd tevatron && pip install -e .

Tevatron 101

In this example, we will demonstrate how to use Tevatron to LoRA fine-tune a Mistral-7B model on the MSMARCO passage dataset. The obtained LLM Retriever is expected to have MRR@10=42.3 on the MS MARCO dev set with straightforward training.

Data Preparation

Tevatron takes training or inference data in jsonl format with each line organized as a json object as follows:

1. Training Data

{
   "query_id": "<query id>",
   "query": "<query text>",
   "positive_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"},
     ...
   ],
   "negative_passages": [
     {"docid": "<passage id>", "title": "<passage title>", "text": "<passage body>"},
     ...
   ]
}

where the passages in positive_passages are the annotated relevant passages of the query and passages in negative_passages are usually non-relevant (hard negative) passages from top results of a retrieval system (e.g. BM25, DPR). Additional fields such as answers for QA datasets can be included as well.

2. Corpus Data

{
   "docid": "<passage id>",
   "title": "<passage title>",
   "text": "<passage body>"
}

where each line represents a passage in the corpus.

Self-Contained Dataset

Tevatron self-contained several commonlly used datasets for neural retrieval. (via HuggingFace). These datasets can downloaded automatically during training and encoding by setting --dataset_name <hgf dataset name>.

In this example, we will use the self-contained dataset Tevatron/msmarco-passage-aug for training, whose hard negative passages are sampled from the mix of top200 BM25 and top200 CoCondenser results.

Run with PyTorch (GPU)

Training

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 4

In batch passages per query: 8x4x16 = 512

Number of queries per update: 8x4x4 = 128

The above training setting tooks about 70 hours on 4xA6000 GPU.

Equivalent training tooks about 110 hours on 1xA100 GPU.

Encoding

Query Encoding

EMBEDDING_OUTPUT_DIR=<folder to save query embedding>
CUDA_VISIBLE_DEVICES=4 python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora_name_or_path retriever-mistral \
  --lora \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --encode_is_query \
  --per_device_eval_batch_size 128 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --dataset_name Tevatron/msmarco-passage \
  --dataset_split dev \
  --encode_output_path $EMBEDDING_OUTPUT_DIR/query-dev.pkl

Corpus Encoding

EMBEDDING_OUTPUT_DIR=<folder to save query embedding>
for s in 0 1 2 3
do
gpuid=$s
CUDA_VISIBLE_DEVICES=$gpuid python -m tevatron.retriever.driver.encode \
  --output_dir=temp \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora_name_or_path retriever-mistral \
  --lora \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --per_device_eval_batch_size 128 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --dataset_name Tevatron/msmarco-passage-corpus \
  --dataset_number_of_shards 4 \
  --dataset_shard_index ${s} \
  --encode_output_path $EMBEDDING_OUTPUT_DIR/corpus.${s}.pkl
done

add & to the end of the command to run in the background in parallel.

Retrieval

set -f && python -m tevatron.retriever.driver.search \
    --query_reps $EMBEDDING_OUTPUT_DIR/query-dev.pkl \
    --passage_reps $EMBEDDING_OUTPUT_DIR/corpus*.pkl \
    --depth 1000 \
    --batch_size 64 \
    --save_text \
    --save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt

The output file is in the format of <query_id> <passage_id> <score> in each line.

Run with JAX (TPU/GPU)

Training

For GPU training, set XLA_PYTHON_CLIENT_MEM_FRACTION=.95 and make sure the query and passage length are multiples of 64 if TransformersEngine is installed.

python -m tevatron.tevax.experimental.mp.train_lora  \
   --checkpoint_dir retriever-mistral-jax \
   --train_file Tevatron/msmarco-passage-aug \
   --model_name mistralai/Mistral-7B-v0.1 \
   --model_type mistral \
   --batch_size 128 \
   --num_target_passages 16 \
   --learning_rate 1e-4 \
   --seed 12345 \
   --mesh_shape 1 -1 \
   --weight_decay 0.00001 \
   --num_epochs 1 \
   --max_query_length 64 \
   --max_passage_length 128 \
   --pooling eos \
   --scale_by_dim True \
   --grad_cache \
   --passage_num_chunks 32 \
   --query_num_chunks 4

In batch passages per query: 128x16 = 2048

Number of queries per update: 128

The above training setting tooks about 35 hours on a v4-8 TPU VM.

Equivalent training tooks about 80 hours on 1xA100 GPU.

Encoding

Query Encoding

python -m tevatron.tevax.experimental.mp.encode  \
   --model_type mistral \
   --model_name_or_path mistralai/Mistral-7B-v0.1 \
   --model_config_name_or_path mistralai/Mistral-7B-v0.1 \
   --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \
   --dataset_name_or_path Tevatron/msmarco-passage \
   --split dev \
   --output_dir $EMBEDDING_OUTPUT_DIR/query-embedding \
   --batch_size 32 \
   --input_type query \
   --max_seq_length 64 \
   --mesh_shape 1 -1 \
   --lora retriever-mistral-jax/lora \
   --scale_by_dim

Corpus Encoding

python -m tevatron.tevax.experimental.mp.encode  \
   --model_type mistral \
   --model_name_or_path mistralai/Mistral-7B-v0.1 \
   --model_config_name_or_path mistralai/Mistral-7B-v0.1 \
   --tokenizer_name_or_path mistralai/Mistral-7B-v0.1 \
   --dataset_name_or_path Tevatron/msmarco-passage-corpus \
   --output_dir $EMBEDDING_OUTPUT_DIR/corpus-embedding \
   --batch_size 32 \
   --input_type passage \
   --max_seq_length 128 \
   --mesh_shape 1 -1 \
   --lora retriever-mistral-jax/lora \
   --scale_by_dim

Retrieval

set -f && python -m tevatron.retriever.driver.search \
    --query_reps $EMBEDDING_OUTPUT_DIR/query-embedding/*.pkl \
    --passage_reps $EMBEDDING_OUTPUT_DIR/corpus-embedding/*.pkl \
    --depth 1000 \
    --batch_size 64 \
    --save_text \
    --save_ranking_to $EMBEDDING_OUTPUT_DIR/run.dev.txt

The output file is in the format of <query_id> <passage_id> <score> in each line.

Citation

If you find Tevatron helpful, please consider citing our paper.

@article{Gao2022TevatronAE,
  title={Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval},
  author={Luyu Gao and Xueguang Ma and Jimmy J. Lin and Jamie Callan},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.05765}
}

Contacts

If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.

Luyu Gao [email protected]
Xueguang Ma [email protected]

Acknowledgement

We thank all the contributors of dependency libraries.
We thank Google's TPU research cloud for providing TPU resources.

tevatron's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes mxueguang spacemanidol arvinzhuang viswavi ikuyamada houwenxin taoshen58 hengyicai justram gregxmhu vjeronymo2 cxa-unique cadurosar nickyongzhang sunsishining edanerg hugoabonizio alphacentauri763 yujwang jianqiang crystina-z manveertamber nlpcode lzh0525 ibanknatoprad jybai 360er0 jasonwu-0803 amyxie361 jasper-xian w32zhong huu4ontocord taosheng-ty theyorubayesian liuqi6777 zhanghan9797 tan-hexiang jwxsp1 kpriyanshu256 mohan-zhang-u hieudx149 skleee thomas0809 loose-gu jordane95 hellozhaojian d0rj soldni yilinjz casually-pylearner sdadas peustr y1jia yc-song ppirch samikhan-cse19 salrowili tien-ngnvan sandy4321 hzlcodus wilfoderek liyongkang123 abgoswam liric24 thakur-nandan octoberchang seungonekim artemisdicotiar alexgrig n-imas jmvcoelho 3jp4rk alin256 dayoon-ko michaelfong2017 yxk9810 silencio94 dylanjoo rgdyy huangwei0102 thibault-formal omega-intel orionw

tevatron's Issues

Specific paper to cite?

Is there any paper that you recommend us to cite if we use Tevatron in a research paper?

About BM25 hard negatives

Hi :)
Thank you for your great work!
I read this issue (#66 (comment)).
In this description,

This downloads the cleaned corpus hosted by RocketQA team, generate BM25 negatives and tokenize train/inference data using BERT tokenizer. \
The process could take up to tens of minutes depending on connection and hardware.

does it means "Tevatron/msmarco-passage" is from cleaned msmarco? @MXueguang

I would like to know whether it uses just BM25 hard negatives or hard negatives with some added treatment.

cross-encoder

Hello, does tevatron currently only support the structure of bi-encoder (query and passage are encoded separately)?

Error when downloading dataset(msmarco-passage) and training unicoil with --train_dir params

Hi,
Instead of connecting directly with the tevatron/msmarco-passage on huggingface, I downloaded the dataset and ran following:

def run():
os.system(f"CUDA_VISIBLE_DEVICES=0
python3 train_unicoil.py
--output_dir unicoil_distilbert
--model_name_or_path distilbert-base-uncased
--save_steps 20000
--train_dir /train.jsonl.gz
--per_device_train_batch_size 8
--train_n_passages 8
--learning_rate 5e-6
--q_max_len 16
--p_max_len 128
--num_train_epochs 3
--add_pooler
--projection_in_dim 768
--projection_out_dim 1
--logging_steps 500
--overwrite_output_dir")

if name == 'main':
run()

BUT i get the following error:

Traceback (most recent call last):
File "train_unicoil.py", line 102, in
main()
File "train_unicoil.py", line 95, in main
trainer.train() # TODO: resume training
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/transformers/trainer.py", line 1500, in train
return inner_training_loop(
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/transformers/trainer.py", line 1716, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/src/tevatron/data.py", line 53, in getitem
encoded_query = self.create_one_example(qry, is_query=True)
File "/src/tevatron/src/tevatron/data.py", line 33, in create_one_example
item = self.tok.prepare_for_model(
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3121, in prepare_for_model
sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
File "/src/tevatron/examples/unicoil/test_teva/lib/python3.8/site-packages/transformers/models/bert/tokenization_bert.py", line 289, in build_inputs_with_special_tokens
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
TypeError: can only concatenate list (not "str") to list

if I run the unicoil example with the param:
--dataset_name Tevatron/msmarco-passage
There is no problems, but I would like to be able to use the param:
--train_dir /train.jsonl.gz \

What can I do to fix the issue?

Questions about training data preprocessing

Hi!

I noticed that "attention_mask" was ignored when preprocessing the training data, as shown in the code of the file src/tevatron/data.py.

class TrainDataset(Dataset):
    def create_one_example(self, text_encoding: List[int], is_query=False):
        item = self.tok.prepare_for_model(
            text_encoding,
            truncation='only_first',
            max_length=self.data_args.q_max_len if is_query else self.data_args.p_max_len,
            padding=False,
            return_attention_mask=False,
            return_token_type_ids=False,
        )
        return item

And I found that some other sources of work on dense retrieval didn't do this. So I want to ask what is the reason for designing the code like this.

Thanks for your answer:)

Simplifying evaluation process

Right now, it's possible to train DPR in a single command, via the tevatron.driver.train module. However, to evaluate, a more complex series of command (involving lower-level for loops) is needed, e.g. for DPR on NQ:

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

python -m tevatron.faiss_retriever \
--query_reps query_emb.pkl \
--passage_reps 'corpus_emb.*.pkl' \
--depth 100 \
--batch_size -1 \
--save_text \
--save_ranking_to run.nq.test.txt

python -m tevatron.utils.format.convert_result_to_trec \
              --input run.nq.test.txt \
              --output run.nq.test.trec

pip install pyserini

python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \
              --topics dpr-nq-test \
              --index wikipedia-dpr \
              --input run.nq.test.trec \
              --output run.nq.test.json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval run.nq.test.json \
                --topk 20 100

I think it would be nicer if all this could be reduce to 1 or 2 commands:

pip install pyserini

python -m tevatron.driver.evaluate \
    --output_dir "temp" \
    --model_name_or_path "model_nq" \
    ...
    --query_dataset "Tevatron/wikipedia-nq/" \
    --passage_dataset "Tevatron/wikipedia-nq/test" \
    --save_ranking_to "nq_results/test/" \
    --encode_method "faiss" \
    --save_format "trec" "pyserini_dpr"  # save in both .trec and .json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval "nq_results/test/run.json" \
                --topk 20 100

Note the usage of tevatron.driver.evaluate in order to keep driver.encode at a lower level and backward compatible, while evaluate would be for higher-level usage like reproducing results. Moreover, tevatron.driver.evaluate could throw an error if pyserini is not available, e.g.:

ImportError: could not import pyserini, a library needed to save as format "pyserini_dpr". Please install with `pip install pyserini`

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset

I have first used tevatron to train DPR from bert-based-uncased:

python -m torch.distributed.launch --nproc_per_node=1 -m tevatron.driver.train \
  --output_dir model_wq \
  --dataset_name Tevatron/wikipedia-wq \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --save_steps 20000 \
  --fp16 \
  --per_device_train_batch_size 128 \
  --train_n_passages 2 \
  --learning_rate 1e-5 \
  --q_max_len 32 \
  --p_max_len 156 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --overwrite_output_dir

After the model was saved to model_wq/ (see footnote), I continued to follow the instructions to encode the passages:

export ENCODE_DIR="wq_corpus_encoded"

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_wq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-wq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

I saved that inside a bash file and ran the bash file, but I multipleJSONDecodeError along the way, which does not seem to be expected (which is why I stopped the process):

$ bash encode_wq_corpus.sh 
mkdir: cannot create directory ‘wq_corpus_encoded’: File exists
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4573.94it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 429.92it/s]
Traceback (most recent call last):                                   
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1210, in _prepare_split
    desc=f"Generating {split_info.name} split",
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/tmp/.cache/huggingface/modules/datasets_modules/datasets/Tevatron--wikipedia-wq-corpus/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033/wikipedia-wq-corpus.py", line 82, in _generate_examples
    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 30 (char 29)
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5849.80it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 517.24it/s]
Traceback (most recent call last):                                  ^C
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1212, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1579, in encode_example
    return encode_nested_example(self, example)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1136, in encode_nested_example
    def encode_nested_example(schema, obj, level=0):
KeyboardInterrupt

Is this normal?

Libraries

This is my requirements file:

git+https://github.com/texttron/tevatron@b8f33900895930f9886012580e85464a5c1f7e9a
torch==1.12.*
faiss-cpu==1.7.2
transformers==4.15.0
datasets==1.17.0
pyserini

Footnote

I originally saved it as model_nq but renamed it to model_wq, I don't think this makes a difference but if it does let me know.
I also tested with wikipedia-nq and with both the latest version on master and also with the 0.1 version on pypi and I'm getting the same error.

question about msmarco passage ranking dataset

Hi :)

Regarding the msmarco passages dataset that gets downloaded from the hugginface (https://huggingface.co/datasets/Tevatron/msmarco-passage/tree/main).

How was this dataset created? as it doesnt match any of the dataset on the original microsoft site(https://microsoft.github.io/msmarco/Datasets.html)

Thanks in advance

RAG Implementation

Hi,
Any plan to implement RAG? RAG already had TF code with hugging face library and it actually uses PyTorch lightning. RAG achieve its best performance with LARGE BART and trained for 100 epochs which indicates that a working code on TPU is much needed, especially it requires large indexing which may not available for regular researchers with GPU but TPUv3-8 VM has 96 CPU and 350 RAM but it does not have GPU.

Error when trying to train on msmarco-passage.

Hi i'am currentlly trying to train a retrievel using tevatron.driver.train but i'm getting an error and i don't jnown how to solve it. hier is the traceback of the error

Traceback (most recent call last):
  File "/home/mboutchouang/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/mboutchouang/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/driver/train.py", line 104, in <module>
    main()
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/driver/train.py", line 85, in main
    trainer = trainer_cls(
  File "/dstore/home/mboutchouang/tevatron/tevatron/src/tevatron/trainer.py", line 105, in __init__
    scaler=self.scaler
AttributeError: 'GCTrainer' object has no attribute 'scaler'

please do someone have an idea??

Missing UniCOIL encoding file

Following UniCOIL instructions it seems like the encode_unicoil.py file is missing in the examples/unicoil folder. Am I missing something?

Update PyPI

We should examine our development progress and update Tevatron on PyPI to propagate recent enhancements and bug fixes.

Reproduce Condenser Result on MSMARCO passage ranking

Hi, wonderful work on this toolkit! I really like it!

Following the README here, I use the following command to train the retriever with Condenser on 2 GPUS which results in the total batch size of 64, the same setting as in the paper:

python -m tevatron.driver.train \
  --output_dir ./output_model \
  --model_name_or_path Luyu/condenser \
  --save_steps 20000 \
  --fp16 \
  --train_dir ../marco/bert/train \
  --per_device_train_batch_size 32 \
  --learning_rate 5e-6 \
  --num_train_epochs 3 \
  --dataloader_num_workers 2

The result I got is 0.331:

#####################
MRR @ 10: 0.3308558466366481
QueriesRanked: 6980
#####################

Is there any parameter I missed to set?
Thanks!

Feature wanted

It would be good to have validation during training.

I understand this might be hard for dense retrievers as encoding collection is expansive.
But I guess we can do something like use DR to rerank top-k docs for a subset of dev queries? I think the reranking score is quite correlated with retrieval scores?

ms marco passage ranking example raises error

I'm trying to reproduce this example: https://github.com/texttron/tevatron/tree/main/examples/msmarco-passage-ranking

Training and encoding are all good, however, retrieval with faiss raises an error:

  File "..../tevatron/src/tevatron/faiss_retriever/reducer.py", line 18, in combine_faiss_results
    rh.add_result(-scores, indices)
  File "..../python3.7/site-packages/faiss/__init__.py", line 1622, in add_result
    swig_ptr(I), self.k)
  File "..../python3.7/site-packages/faiss/swigfaiss.py", line 5700, in swig_ptr
    return _swigfaiss.swig_ptr(a)
ValueError: did not recognize array type

I think the reason for this is the indices are actually numpy arrays with string ids but faiss wants int64 ids.

Seems is this update break the code? @MXueguang

-    psg_indices = [[int(p_lookup[x]) for x in q_dd] for q_dd in all_indices]
+   psg_indices = [[str(p_lookup[x]) for x in q_dd] for q_dd in all_indices]

One lazy fix for this: indices = indices.astype(np.int64)

I'm using faiss-cpu==1.7.1

Adding support for native Huggingface gradient checkpointing

It would be nice to have support for Huggingface's gradient checkpointing capabilities.
With models like BERT, you can simply do:

model.gradient_checkpointing_enable()

Here's the docs on the subject.

I think this might be a good option for models that might not be supported by GradCache (or maybe both can be combined to save even more memory, though I'm not sure if they are compatible).

Is there a difference between query and passage encoder?

I understand the need to keep them separate, and that they have different weights after training on MSMARCO, but is there a difference in how they handle the input sequence? In specific, when calling model = AutoModel.from_pretrained('co-condenser-marco') , what do we get, a query or passage encoder? Or both, and we need to pass a special token type ID token depending on whether a query or passage is given as input? What does the option python -m tevatron.driver.encode --encode_is_qry practically do, compared to omitting it?

Error when training

I'm trying to train my local checkpoint using vinai/phobert on hugging face hub but I got this problem.

How can I solved it?

self.dataset.map in processors won't be cached

When using the self-contained training and corpus dataset the pretokenize process will be executed every time with the suggested library versions (torch==1.10.1, faiss-cpu==1.7.2, transformers==4.15.0, datasets==1.17.0).

The problem is self.dataset.map in processor classes is not caching properly, this seems to be a known issue of transformers datasets: huggingface/datasets#4506

A solution to this is to reinstall datasets with a downgraded dill package: pip install datasets==1.17.0 "dill<0.3.5"

It seems this issue has been fixed in a newer version of datasets as well.

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved)

Hi,

I experienced issues when working with the colbert example.
I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert

I then encoded the corpus and queries:

corpus:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--p_max_len 128
--dataset_name Tevatron/msmarco-passage-corpus
--encoded_save_path /corpus_emb_colbert/
--encode_num_shard 20
--encode_shard_index {s}

queries:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--encode_is_qry
--q_max_len 32
--dataset_name Tevatron/msmarco-passage/dev
--encoded_save_path /queries_emb.tsv"

When trying to index using:

python -m pyserini.index.lucene
--collection JsonVectorCollection
--input /model_runs/corpus_emb_colbert
--index /model_runs/index_colbert
--generator DefaultLuceneDocumentGenerator
--threads 12
--impact --pretokenized --optimize

it failed with the following messeage:

2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (BufferedReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')

I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:

python -m tevatron.faiss_retriever
--query_reps /home/fdt672/model_runs/queries_emb_colbert_{train_split}/queries_emb_train_split_20.tsv
--passage_reps /home/fdt672/model_runs/corpus_emb_colbert_{train_split}/'*.jsonl'
--depth 100
--batch_size -1
--save_text
--save_ranking_to /home/fdt672/model_runs/rank_colbert_{train_split}.txt

But it also faulted with:

Traceback (most recent call last):
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'

Solution:
As I understand the issues was that the value need to be float32 and not float16:
So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)

_class BaseFaissIPRetriever:
    def __init__(self, init_reps: np.ndarray):
        index = faiss.IndexFlatIP(init_reps.shape[1])
        self.index = index

    def add(self, p_reps: np.ndarray):
        **p_reps_float32 = p_reps.astype(np.float32)** #  <------- issues with float16

        self.index.add(p_reps_float32)
    def search(self, q_reps: np.ndarray, k: int):
        **q_reps_float32 = q_reps.astype(np.float32)** # < ------- issues with float16

        return self.index.search(q_reps_float32, k)
       .....

the tevatron.faiss_retriever worked.

I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)

I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?

Thanks alot in advance :)

How to reproduce the results on NQ

Hi, @luyug

Thanks for your awesome work.
Is it possible to give more details to reproduce the results (84.3=MRR@5) on NQ in the paper, just like the detailed MS MARCO tutorial demo?

Looking forward to your reply.
Thanks.

[RuntimeError: Input, output and indices must be on the current device] when training with multiple GPUs

Bug

When following the MS MARCO passage ranking example there is a RuntimeError when using multiple GPUs for training.

Starting the training via

python -m tevatron.driver.train \
  --output_dir ./retriever_model \
  --model_name_or_path bert-base-uncased \
  --save_steps 20000 \
  --train_dir ./marco/bert/train \
  --fp16 \
  --per_device_train_batch_size 2 \
  --learning_rate 5e-6 \
  --num_train_epochs 2 \
  --dataloader_num_workers 2

produces:

RuntimeError: Input, output and indices must be on the current device

Note: When running the training with above command and only one visible gpu the training starts and runs correctly.

Full Error Message

Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/driver/train.py", line 118, in <module>
    main()
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/driver/train.py", line 110, in main
    trainer.train(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/trainer.py", line 1286, in train
    tr_loss += self.training_step(model, inputs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/trainer.py", line 65, in training_step
    return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/trainer.py", line 1777, in training_step
    loss = self.compute_loss(model, inputs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/trainer.py", line 62, in compute_loss
    return model(query=query, passage=passage).loss
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/modeling.py", line 107, in forward
    q_hidden, q_reps = self.encode_query(query)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/modeling.py", line 173, in encode_query
    qry_out = self.lm_q(**qry, return_dict=True)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 984, in forward
    embedding_output = self.embeddings(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 156, in forward
    return F.embedding(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

Environment

python == 3.8.12
pytorch == 1.8.2
faiss-cpu == 1.7.1
transformers == 4.9.2
datasets == 1.11.0

CUDA Version: 10.1
Operating System: Debian GNU/Linux 10 (buster)
Kernel: Linux 4.19.0-18-amd64
GPUs: 4x GTX 1080Ti 11GB
CPU: Intel E5-2620v4

InfoLOOB Loss

Hi,
Thanks for the great work!

Do you think InfoLOOB (formulation here, implementation here) would be a good addition to this library? Seems like it outperforms InfoNCE in an image-text setting; I thought it might be worth experimenting with it on purely-text-based IR tasks

Weights and Biases (WandB) integration?

Is there a plan to add integration with weights and biases? It would be nice to be able to monitor the progress of training and evaluation all in one place. Maybe it could be made optional with the flag --wandb="project-name" and only import when that flag is active.

Reproduction issue of coCondenser NQ

I use the hard negative (hn.bert.json) you provided and I can reproduce R@5=75.8
But when I train with my own hard negatives, R@5 is only 64.3

How to generate hard negatives for NQ? Could you provide a reproduction setup?

Here is the setup for my mining hard negatives:
Model: co-condenser-wiki trained with bm25-negative
Negative depth: 200
Negative sample: 30

Looking forward to your reply!!! Thank you!

coCondenser MS-MARCO Passage Retrieval example raises error

I'm trying to reproduce this example: [(https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco)]

Inference with fine-tuned checkpoint(encode and Index-search) are all good, however, fine-tuning stage 1 (training) raises an error:
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/dmx/.cache/huggingface/datasets/json/default-e642d34fc5e4ebf2/0.0.0/793d004298099bd3c4e61eb7878475bcf1dc212bf2e34437d85126758720d7f9... 10/29/2021 10:44:35 - WARNING - datasets.builder - Using custom data configuration default-e642d34fc5e4ebf2 Traceback (most recent call last): File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 117, in <module> main() File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/driver/train.py", line 82, in main train_dataset = TrainDataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/tevatron/data.py", line 29, in __init__ self.train_data = datasets.load_dataset( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/load.py", line 856, in load_dataset builder_instance.download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 583, in download_and_prepare self._download_and_prepare( File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/builder.py", line 639, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/home/dmx/anaconda3/lib/python3.8/site-packages/datasets-1.9.0-py3.8.egg/datasets/packaged_modules/json/json.py", line 46, in _split_generators raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}") ValueError: At least one data file must be specified, but got data_files=None

I think the problem maybe come from lines 25~32 of tevatron.driver.train.py, but I don't know the specific reason or how to solve it:
if isinstance(path_to_data, datasets.Dataset): self.train_data = path_to_data else: self.train_data = datasets.load_dataset( 'json', data_dir=path_to_data, ignore_verifications=False, )['train']

The way to reproduce the results in DPR table 2

I have a question about how to experiment with BM25 baseline and incorperate BM25 with DPR, to get the similar results with DPR table 2?

Thanks a lot

Issues running DPR using tevatron.

I am trying to repro the DPR with NQ benchmark but I am hitting errors as following the readme leads to errors. Any idea whats up?

les.json.json - Failed to read file '/home/spacemanidol/tevatron/nq/biencoder-nq-train.json' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column() changed from object to array in row 0
for obj in iterable:
File "/home/spacemanidol/.local/lib/python3.8/site-packages/datasets/packaged_modules/json/json.py", line 140, in _generate_tables
f"Not able to read records in the JSON file at {file}. "
AttributeError: 'list' object has no attribute 'keys'

I can't load data from Huggingface

I'd run train commend :
python -m tevatron.driver.train --output_dir ./retriever_model_s1 --model_name_or_path Luyu/co-condenser-marco --save_steps 10000 --dataset_name Tevatron/msmarco-passage-corpus --train_dir ./marco/bert/train --fp16 --per_device_train_batch_size 8 --learning_rate 5e-6 --num_train_epochs 3 --dataloader_num_workers 2

but there are issue in loading datasets from Huggingface :
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus/resolve/main/marco/bert/train

There are no data in that url, so I want to ask if there are alternative routes to get MSMARCO data.

How to reproduce co-condenser-marco？

Hi,
Thank you for your great work!
I encounter some issues when I tried to reproduce the results of co-condenser-marco on MARCO passage. I have referred to the tutorial, but still cannot solve it (the problem seems to be in the step of mining hard negatives).

First, I run Fine-tuning Stage 1 with

CUDA_VISIBLE_DEVICES=3 python -m tevatron.driver.train \
  --output_dir model_msmarco_s1 \
  --model_name_or_path ../data/co-condenser-marco \
  --save_steps 20000 \
  --train_dir ../data/msmarco-passage/train_dir \
  --data_cache_dir ../data/msmarco-passage-train-cache \
  --fp16 \
  --dataloader_num_workers 2 \
  --per_device_train_batch_size 8 \
  --train_n_passages 8 \
  --learning_rate 5e-6 \
  --q_max_len 16 \
  --p_max_len 128 \
  --num_train_epochs 3 \
  --logging_steps 500 \

, and get MRR@10=0.3596, R@1000=0.9771. (Your reported results are MRR@10=0.357, R@1000=0.978).

Then, I run the hard negative mining with random sampling 30 negatives from the top-200 retrieval results of model_msmarco_s1 by modifying scripts/hn_mining.py (according to the parameters in build_train_hn.py).

Second, I run Fine-tuning Stage 2 with

CUDA_VISIBLE_DEVICES=3 python -m tevatron.driver.train \
  --output_dir model_msmarco_s2 \
  --model_name_or_path ../data/co-condenser-marco \
  --save_steps 20000 \
  --train_dir ../data/msmarco-passage/tain_dir_hn_dr_cocondenser200 \
  --data_cache_dir ../data/msmarco-passage-tain_hn_dr_cocondenser200-cache \
  --fp16 \
  --dataloader_num_workers 2 \
  --per_device_train_batch_size 8 \
  --train_n_passages 8 \
  --learning_rate 5e-6 \
  --q_max_len 16 \
  --p_max_len 128 \
  --num_train_epochs 2 \
  --logging_steps 500 \

, and get MRR@10=0.3657, R@1000=0.9761. (Your reported results are MRR@10=0.382, R@1000=0.984).

There are several possible issues that I would like to confirm:

The training data for Fine-tuning Stage 2 only is hard negatives, having not been concatenated with BM25 negatives?
The initial parameters of Fine-tuning Stage 2 are from co-condenser-marco, not the checkpoint of model_msmarco_s1?
The setting of per_device_train_batch_size, train_n_passages, learning_rate, and num_train_epochs in Fine-tuning Stage 2 ?

Thank you in advance!

ImportError: cannot import name 'TevatronTrainingArguments' from 'tevatron.arguments'

Hi I get following error when trying to run the training for unicoil:

File "/tevatron/examples/unicoil/train_unicoil.py", line 11, in
from tevatron.arguments import ModelArguments, DataArguments,
ImportError: cannot import name 'TevatronTrainingArguments' from 'tevatron.arguments' (/home/user/anaconda3/envs/teva_env/lib/python3.10/site-packages/tevatron/arguments.py)

why no index for Dense Retrieval models

Hi,

I was just wandering why we are not creating indexes for dpr models? and using tevatron.faiss_retriever right after encoding the queries and corpus.

Thansk

questions about negative sample

So thanks for the novel work.

I have a question that if I use a custom dataset with the same format in MS MARCO, how to choose the negative sample on dev/test set?

For example, if I want to find if BM25 negative sample is better than random sample in my filed. I use BM25 negative samples in the training stage, negative samples the dev/test set should be made by BM25 or random?

Cannot reproduce MRR@10 of Luyu/co-condenser-marco-retriever

Hi @luyug ,
Thanks for your great job!
I've use the two stages training described in examples/coCondenser-marco/README.md, but the final model is only MRR@10 36.2, less than the model released (Luyu/co-condenser-marco-retriever 38.2).

Besides, should we need the negatives_x_device option when training a dual-encoder for MSMARCO.

how to improve the results with uniCOIL

Hi,
Thanks for the great work!

I run experiments with the `modeling' in tevatron (but the data loader is implemented by myself) on msmarco-passage. For DenseModel, the result can achieve to be MRR@10: 0.31+. But for uniCOIL, the result is only abot MRR@10: 0.26+ (this is far from your result 0.328).

In fact, I noticed that the implementation of uniCOIL in tevatron is somewhat different from the paper (https://github.com/luyug/COIL). By comparison, I also run experiments with the code in https://github.com/luyug/COIL (dim=1 and no_cls), and get the similar result as above.

Can you provide some insight on how to improve this result. For example, is there any special operation on the data for uniCOIL? I also try to initialize the model with distilbert as the example shows, but get a worse result.

How to reproduce results on Marco

Thanks for your great work! I noticed the training hyperparamers in your github repo(training batchsize, epochs, etc.) are different from those in your paper for training dense retriver on Ms Marco. Could you provide the hyperparameters for reproducing the results in Table 3 in your paper? Thanks!

RunTimeError when training SPLADE - .get_world_size() issues

Hi,

Im trying to train a splade model using the guidelines at (https://github.com/texttron/tevatron/tree/main/examples/splade),
But I am getting following runtime error

Traceback (most recent call last):
File "/home/src/tevatron/examples/splade/train_splade.py", line 135, in
main()
File "/home/src/tevatron/examples/splade/train_splade.py", line 116, in main
trainer = SpladeTrainer(
File "/home/src/tevatron/examples/splade/train_splade.py", line 31, in init
self.world_size = torch.distributed.get_world_size()
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
return _get_group_size(group)
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
default_pg = _get_default_group()
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

code snippet from train_splade.py:

class SpladeTrainer(TevatronTrainer):
    def __init__(self, *args, **kwargs):
        super(SpladeTrainer, self).__init__(*args, **kwargs)
        self.world_size = torch.distributed.get_world_size()

Do you know why am I getting this error?

Thanks alot in advance :)

a bug

In https://github.com/texttron/tevatron/tree/main/examples/coCondenser-marco#index-search

python ../msmarco-passage-ranking/score_to_marco.py rank.txt should be replaced with
python ../msmarco-passage-ranking/score_to_marco.py rank.tsv

example of splade

why q_max_len=128 in https://github.com/texttron/tevatron/tree/main/examples/splade/readme.md? Is it just a clerical error or special considerations?

Thank you.

how to replicate leaderboard number?

Hi,

I was able to replicate the MRR@10 that you reported in the paper ( 0.38) but I was wondering what is the difference between the number that is reported on the leaderboard ( 0.44) vs 0.38?
How do I replicate that? is it on a different set?

Why to scale the loss when DDP trainning

Dear Luyu,

Thank you for the great package simplifying the cumbersome DR training & evaluation process! I try using it in DDP mode and I read the code. I have a question regarding the loss:

InDenseModel class, if the model is running in DDP, you scale the loss by word_size (number of gpus used). In DenseTrainer you unscale it after the backward process. Why do you introduce the scaling & unscaling process?

Best regards,
Shi

Can I train with my own data?

Can I train with my own data? How do I modify the augments?

Question about reproducing coCondenser-nq

Hi, @luyug.

Thanks for your awesome work and detailed guidelines.
I reproduced the model according to coCondenser-nq's [README](https://github.com/texttron/tevatron/tree/main/examples/coCondenser-nq). But I got the following results.(results from pyserini)

Top5    accuracy: 0.3526315789473684                         
Top20   accuracy: 0.47700831024930745 
Top100  accuracy: 0.5833795013850416

I think I made a mistake in one step, so that the results is lower than the results on bm25. I sequentially execute the following scripts to train the model.(The model co-condenser-wiki was downloaded from huggingface.)

#prepare_data.sh

nq_train_path="/data2/private/xxx/DPR/downloads/data/retriever/nq-train.json" #biencoder-nq-train.json
output_path="/data2/private/xxx/condenser/nq-train/bm25.bert.json"
model_path="/data2/private/xxx/model/co-condenser-wiki"
hn_path="/data2/private/xxx/condenser/hn.json"
output_hn_path="/data2/private/xxx/condenser/nq-train/hn.bert.json"
python prepare_wiki_train.py --input $nq_train_path --output $output_path --tokenizer $model_path

python prepare_wiki_train.py --input $hn_path --output $output_hn_path --tokenizer $model_path

#train_nq.sh
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
output_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
  --output_dir $output_path \
  --model_name_or_path $CONDENSER_MODEL_NAME \
  --cache_dir $cache_path \
  --do_train \
  --save_steps 10000 \
  --train_dir $train_path \
  --fp16 \
  --per_device_train_batch_size 32 \
  --train_n_passages 2 \
  --learning_rate 5e-6 \
  --q_max_len 32 \
  --p_max_len 256 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --positive_passage_no_shuffle \
  --untie_encoder \
  --grad_cache \
  --gc_p_chunk_size 24 \
  --gc_q_chunk_size 8

#encode_emb_passage.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model

echo $1 #  $1 is the id of GPU
for s in $(seq -f "%02g" $2 $3) # 0 - 19
do
CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --cache_dir $cache_path \
  --model_name_or_path $model_path/checkpoint-40000/passage_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --p_max_len 256 \
  --dataset_proc_num 8 \
  --encode_in_path $wiki_dir/docs$s.json \
  --encoded_save_path $emb_nq_path/$s.pt \
  --encode_num_shard 20 \
  --passage_field_separator sep_token \
  --encode_shard_index $s
done

#encode_emb_query.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq/"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model


# query

CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --model_name_or_path $model_path/checkpoint-40000/query_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --q_max_len 32 \
  --dataset_proc_num 2 \
  --encode_in_path $query_path \
  --encoded_save_path $emb_query_path/query.pt \
  --encode_is_qry

#inference.sh

ENCODE_QRY_DIR="/data2/private/xxx/condenser/embeddings-nq-queries/"
ENCODE_DIR="/data2/private/xxx/condenser/embeddings-nq/"
DEPTH=200
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"

MODEL_DIR=nq-model
python -m tevatron.faiss_retriever \
--query_reps $ENCODE_QRY_DIR/query.pt \
--passage_reps $ENCODE_DIR/'*.pt' \
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to $RUN

#eval.sh
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
trec_out="/data2/private/xxx/condenser/run.nq.test.teIn"
json_out="/data2/private/xxx/condenser/run.nq.test.json"
python -m tevatron.utils.format.convert_result_to_trec \
    --input $RUN --output $trec_out


python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --topics dpr-nq-test \
                                                                --index wikipedia-dpr \
                                                                --input $trec_out \
                                                                --output $json_out

python -m pyserini.eval.evaluate_dpr_retrieval --retrieval $json_out \
    --topk 5 20 100

Is there any parameter I set wrong?

Thanks!

Question regarding coCondenser-marco fine-tuning

Hi,

Thanks for the codebase.

I have a question about coCondenser-marco fine-tuning command here

For example:

python -m tevatron.driver.train \  
  --output_dir ./retriever_model_s1 \  
  --model_name_or_path CONDENSER_MODEL_NAME \  
  --save_steps 20000 \  
  --train_dir ./marco/bert/train \
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 3 \  
  --dataloader_num_workers 2

Is this (plus the command for Fine-tuning Stage 2) the right command to reproduce Luyu/co-condenser-marco-retriever?

It seems that this command only uses a single GPU with batch size = 8. (on the other hand, fine-tuning commands in coCondenser-nq uses 4 GPU with larger batch size (32) and negatives_x_device.

error when using the -untie_encoder param for DPR - config.json not in the correct path.

Hi,

I had the following issue today:

When using the -untie_encoder params in this guide: https://github.com/texttron/tevatron/blob/main/examples/example_dpr.md
When done training the DPR model it does not include the config.json in the --output_dir.

This gives issues when trying the encode corpus and queries.
I moved the config.json from either the passage_model/ or the query_model/ folder to the path stated in the "--output_dir":

How to reproduce cocondenser-nq？

Hi, I encode and search according to coCondenser-nq's README, but my reproduced results are :

Top5    accuracy: 0.2889
Top20   accuracy: 0.4681
Top100  accuracy: 0.6343

Here are my bash:

#encode_doc.sh
OUTDIR=/data/private/xxx/dataset/nq/new
MODEL_DIR=Luyu/co-condenser-wiki
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
 --config_name $MODEL_DIR \
 --output_dir $OUTDIR \
 --model_name_or_path $MODEL_DIR \
 --fp16 \
 --per_device_eval_batch_size 64 \
 --p_max_len 256 \
 --dataset_proc_num 8 \
 --dataset_name Tevatron/wikipedia-nq-corpus \
 --encoded_save_path /data/xxx/dataset/nq/$s.pt \
 --encode_num_shard 20 \
 --passage_field_separator sep_token \
 --encode_shard_index $s
done

#encode_query.sh
python -m tevatron.driver.encode \
 --output_dir=$OUTDIR \
 --model_name_or_path $MODEL_DIR \
 --config_name $MODEL_DIR \
 --fp16 \
 --per_device_eval_batch_size 64 \
 --q_max_len 32 \
 --dataset_proc_num 2 \
 --dataset_name Tevatron/wikipedia-nq/train \
 --encoded_save_path /data/private/xxx/dataset/nq/new/query_train.pt \
 --encode_is_qry

#search
DEPTH=200
python -m tevatron.faiss_retriever \
--query_reps /data/private/xxx/dataset/nq/new/query_test.pt \
--passage_reps /data/private/xxx/dataset/nq/new/'*.pt' \
--depth $DEPTH \
--batch_size 128 \
--save_text \
--save_ranking_to /data/private/xxx/dataset/nq/new/run.nq.test.txt

python -m tevatron.utils.format.convert_result_to_trec --input run.nq.test.txt --output run.nq.test.teIn

#eval
python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --topics dpr-nq-test \
                                                               --index wikipedia-dpr \
                                                               --input run.nq.test.teIn \
                                                               --output run.nq.test.json

python -m pyserini.eval.evaluate_dpr_retrieval --retrieval run.nq.test.json --topk 5 20 100

How can I solve this problem?

How to reproduce co-condenser-marco？

Hi, I just follow example_msmarco.md and using "Luyu/co-condenser-marco-retriever". my reproduced result is 37, but the result in paper is 38.2. May I ask what I need to do to obtain the results in the paper.
Here are my bash:

CUDA_VISIBLE_DEVICES=0 python -m tevatron.driver.encode \ --output_dir=temp \ --model_name_or_path Luyu/co-condenser-marco-retriever \ --fp16 \ --per_device_eval_batch_size 1024 \ --p_max_len 128 \ --dataset_name Tevatron/msmarco-passage-corpus \ --encoded_save_path c/corpus_emb.pkl \ --encode_num_shard 1 \ --encode_shard_index 0

CUDA_VISIBLE_DEVICES=0 python -m tevatron.driver.encode \ --output_dir=temp \ --model_name_or_path Luyu/co-condenser-marco-retriever \ --fp16 \ --per_device_eval_batch_size 1024 \ --dataset_name Tevatron/msmarco-passage/dev \ --encoded_save_path temp_out/query_emb.pkl \ --q_max_len 32 \ --encode_is_qry

python -m tevatron.faiss_retriever \ --query_reps temp_out/query_emb.pkl \ --passage_reps temp_out/corpus_emb.pkl \ --depth 100 \ --batch_size -1 \ --save_text \ --save_ranking_to temp_out/rank.txt

python -m tevatron.utils.format.convert_result_to_marco \ --input temp_out/rank.txt \ --output temp_out/rank.txt.marco

python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset temp_out/rank.txt.marco

How are GitHub releases, version tags, and release cycles handled?

It seems that only the initial 0.1.0 version was released on PyPi, and the version was not tagged on GitHub. There's also no release notes on GitHub.

My questions are:

Will each release be tagged on GitHub (with possibly release notes) before being released to PyPi?
What will be the release cycles (will each new feature warrant a minor version bump, or do you plan to bundle all changes to be released every X months)?
Will this library be backward compatible before reaching version 1.0, and will there be feature deprecation warnings and/or two-minor-version deprecation cycles?

Related: #16

If automating pypi release via GitHub release sounds interesting, I think this is a good place to start: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries

Add some documentations and example usages about two files in scripts folder

Hi, I really like this repo since it offers a simple yet powerful toolkit to reproduce the existing models and conduct some new experiments.

However, I'm curious about the two files in scripts folder (i.e. hn_mining.py and reduce_results.py) when I browse the source code because they are isolated and not being used by any of other files in this repo. So could you please add some documentation and example usages about these two files to make us better understand how to use them? Thank you in advance!

empty answer field in squad dataset

Hi,
I noticed that the squad dataset on hf has an empty answers field for all instances. Maybe there is a problem during data processing?

texttron / tevatron Goto Github PK

tevatron's Introduction

Tevatron V2

Features

Installation

Tevatron 101

1. Training Data

2. Corpus Data

Self-Contained Dataset

Training

Encoding

Query Encoding

Corpus Encoding

Retrieval

Training

Encoding

Query Encoding

Corpus Encoding

Retrieval

Citation

Contacts

Acknowledgement

tevatron's People

Contributors

Stargazers

Watchers

Forkers

tevatron's Issues

Libraries

Footnote

Bug

Full Error Message

Environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs