GithubHelp home page GithubHelp logo

markers_bert's Introduction

MarkerdBert

This repository contains the code for our SIGIR 2020 paper : MarkedBERT: Integrating Traditional IR cues in pre-trained language models for passage retrieval.

This code is no longer maintained. Check my new repository using tensorflow 2.0 for training on both GPU and Colab TPU is available, the newly trained checkpoints are available on hugginface hub. It is important to note that the new checkpoints are trained with more data and thus lead to different performance.

First Stage: Doc2query Passage Expansion + BM25

We use the traditional BM25 to retrieve an intital list of the top 1000 passages per query. To avoid the "vocabulary mismatch" problem we apply the Doc2query passage expansion technique by (Nogueira et al., 2019). Here is the link to the github repo.

Second Stage: BERT re-ranking

Data preparation

First we need to put MsMarco data in the appropriate format.

  • Links for dowmloading MsMarco corpus :
DATA_DIR=./Data
mkdir $DATA_DIR

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.small.tsv -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/queries.train.tsv -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz -P ${DATA_DIR}

tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/collection.tar.gz -C ${DATA_DIR}
  • Fine-tuning Data : Use the construct_dataset_msmarco.ipynb notebook to obtain the .csv file containing unique pairs from the triples.train.small.tsv file and balanced number of relevant/ non-relevant pairs.

  • Inference data: Once you get the run file from the first stage (download here). Use the notebook that produces two files, the first is the the data file dataset.csv and the second is the query-passage ids mapping file query_doc_ids.txt.

Base Model

Basic BERT-base model re-ranker that uses [CLS] token for classification. Use the script below to fine-tune this model and evaluate it:

python ./Modeling/modeling_base.py \
      --data_dir=$DATA_DIR \
      --output_dir=$OUTPUT_DIR \
      --max_seq_length=512 \
      --do_train \
      --do_eval \
      --per_gpu_eval_batch_size=32 \
      --per_gpu_train_batch_size=32 \
      --gradient_accumulation_steps=1 \
      --learning_rate=3e-6 \
      --weight_decay=0.01 \
      --adam_epsilon=1e-8 \
      --max_grad_norm=1.0 \
      --num_train_epochs=2 \
      --warmup_steps=10398 \
      --logging_steps=1000 \
      --save_steps=25996 \
      --seed=42 \
      --local_rank=-1 \
      --overwrite_output_dir

MarkedBERT: Incorporating Exact Match signals via Markers

  1. We first need to mark both the training dataset and the dev set using this script:
python ./Markers/simple_marker.py \
      --data_path=$path_to_dataset.csv
      --output_path=$path_to_marked_data
  1. Fine-tune the BERT-base model using the marked data and evaluate it on the marked dev set :
python ./Modeling/modeling_markers.py \
      --data_dir=$DATA_DIR \
      --output_dir=$OUTPUT_DIR \
      --max_seq_length=512 \
      --do_train \
      --do_eval \
      --do_lower_case\
      --per_gpu_eval_batch_size=32 \
      --per_gpu_train_batch_size=32 \
      --gradient_accumulation_steps=1 \
      --learning_rate=3e-6 \
      --weight_decay=0.01 \
      --adam_epsilon=1e-8 \
      --max_grad_norm=1.0 \
      --num_train_epochs=4 \
      --warmup_steps=10398 \
      --logging_steps=1000 \
      --save_steps=25996 \
      --seed=42 \
      --local_rank=-1 \
      --overwrite_output_dir

Evaluation

We use Anserini evaluation script for msmarco. The evaluation_script.ipynb notebook illustrates the steps for downloading Anserini and use it in googe colab notebook in order to evaluate the run files obtained above.

markers_bert's People

Contributors

boualililila avatar

Stargazers

 avatar ChronousZ avatar Arian Askari avatar  avatar  avatar  avatar AZZOUG Aghiles avatar Samuel avatar  avatar Michael Chien avatar GMFTBY avatar Murali Mohana Krishna Dandu avatar  avatar berton avatar  avatar wzm avatar  avatar AI炼丹师 avatar  avatar  avatar liuhl avatar Phenix avatar  avatar Bizzozzéro Nicolas avatar

Watchers

 avatar paper2code - bot avatar

markers_bert's Issues

Error in Evaluation

I am not being able to run evaluation using the following command :

python ./Modeling/modeling_base.py \
      --data_dir=$DATA_DIR \
      --output_dir=$OUTPUT_DIR \
      --max_seq_length=512 \
      --do_eval \
      --per_gpu_eval_batch_size=32 \
      --per_gpu_train_batch_size=32 \
      --gradient_accumulation_steps=1 \
      --learning_rate=3e-6 \
      --weight_decay=0.01 \
      --adam_epsilon=1e-8 \
      --max_grad_norm=1.0 \
      --num_train_epochs=2 \
      --warmup_steps=10398 \
      --logging_steps=1000 \
      --save_steps=25996 \
      --seed=42 \
      --local_rank=-1 \
      --overwrite_output_dir

(note that I have removed do-train, since model has already been trained)

The following few are the issues I see here -
In the code handling evaluation, {args.data_dir}/doc2query_run/base/run_dev.csv has been referenced. However there is no run_dev.csv in the first place. We downloaded run.dev.small.tsv in the earlier steps. Is this what is being referenced?

Secondly, if run.dev.small.tsv indeed is being referenced then also its in a different format (its not in the format - query, text, label). However the way the file is fed into dataset doesn't seem to handle this difference in format and hence the code fails to run.

Can someone help me out here? Quick response will be appreciated. Thanks!

output of qytpes

Hi,

I just wanted to kindly ask you to share the query types that you extracted for queries of MSMARCO DEV during this research if you still have access to it. I have tried working with qtypes but there is a technical issue that causes the model does not detect any "Entity" type even for the test examples on their repository. So I thought asking you could be a good idea.

Best regards,
Arian

Where is Data_processing?

from Data_processing import clean_text, write_to_tf_record

In the line linked above Data_processing module has been referenced. However there seems to be no such local module or even some module with this name that can be installed by pip. This results in error during code run. Can you please help?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.