GithubHelp home page GithubHelp logo

jingtaozhan / drhard Goto Github PK

View Code? Open in Web Editor NEW
124.0 4.0 14.0 363 KB

SIGIR'21: Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track.

License: BSD 3-Clause "New" or "Revised" License

Python 99.38% Shell 0.62%
information-retrieval pytorch web-search

drhard's Introduction

Hi there 👋 This is Jingtao Zhan.

  • 🌱 I’m a third-year PhD student at Tsinghua IR Group supervised by Prof. Shaoping Ma and Prof. Yiqun Liu.
  • 🔭 My research lies in Information Retrieval and Web Search. I currently focus on Dense Retrieval with a wide interest in improving its effectiveness, efficiency, and interpretability. The publications are available at my homepage.
  • 📫 Contact me via [email protected] or twitter.

drhard's People

Contributors

jingtaozhan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

drhard's Issues

keyError of rankdict

Hi,jingtao

When I train the STAR model. There exist an error.

File "./dataset.py", line 176, in getitem
hardpids = random.sample(self.rankdict[str(qid)], self.hard_num)
KeyError: '18337'

The valid keys of rankdict should be 1,2,...,6980. Am I right?

import error when do STAR inference

There is an ImportError when I am trying to replicate your work

$python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev
Traceback (most recent call last):
  File "./star/inference.py", line 15, in <module>
    from model import RobertaDot
  File "/home/yicheng.fyc/DRhard/./model.py", line 10, in <module>
    from transformers.modeling_roberta import RobertaPreTrainedModel
ImportError: cannot import name 'RobertaPreTrainedModel' from 'transformers.modeling_roberta' (/home/yicheng.fyc/miniconda2/envs/adore/lib/python3.8/site-packages/transformers-2.8.0-py3.8.egg/transformers/modeling_roberta.py)

It happens for both 2.8.0 and 3.4.0 of transformers.
upgrade transformers to 4.2 will fixed this problems but it will leads a huge gap between the MRR index.

cannot find trained_models

Is there supposed a folder contained trained_models, after I clone and install the codes? Or I need to train myself? Thanks!

关于transformers库的版本问题

您好,感谢您开源代码!

我在尝试运行的时候发现,您在README里提到,transformers的版本要使用2.8.0,因为3以上的版本里tokenizer的行为不一致;但是在setup.py文件中,却指定了 transformers==3.4.0,这是为什么呢?我应该使用哪个版本的transformers库呢?

How did you evaluate on trec 2019 test

Hi,

I can't find the instruction to replicate the nDCG performance on TREC 19.
Could you tell me how to run the evaluation on TREC 19 test set.

Thanks.

Combined loss implementation

Hi, I am trying to understand how you combined the hard negative loss Ls with the in-batch random negative loss Lr, as in the paper the in-batch random negative loss is scaled by an alpha hyperparameter but there is no mention of the value of alpha you used in the experiments.

Following star/train.py I found the RobertaDot_InBatch model, whose forward function calls the inbatch_train method.

A the end of the inbatch_train method (line 182), I found

return ((first_loss + second_loss) / (first_num + second_num),)

which is different from the combined loss proposed in the paper (Eq. 13).

Am I missing something?

Also, for each query in the batch, did you consider all the possible in-batch random negatives or just one?

Thanks in advance!

A possible bug in Data Process

Thanks for your released beautiful and easy-to-follow code ! It is very helpful to the dense retrieval researchers as me.

However, maybe I have met a bug in the preprocess.py. As the model requires to use the RoBERTa-base model for initialization, the native pad token is 1 but not 0. While I found that in this file, the pad token is set to 0 (the [CLS] token id), which has caused that the pre-trained model checkpoint can not achieve the same results as reported in your paper.
https://github.com/jingtaozhan/DRhard/blob/main/preprocess.py#L23

RobertaDot_NLL_LN class not defined?

Hi, jingtao

I find that the adore model released is not defined in your code.

The config.json file indicates that the model architecture is "RobertaDot_NLL_LN", however , it seems not defined in model.py.

Hard Negative generation

Hi Jingtao,

Thanks for open-source this interesting work and detailed implementation.

I have a question for the hard negative generation, is the mode should be train instead of dev?

python ./star/prepare_hardneg.py \
--data_type passage \
--max_query_length 32 \
--max_doc_length 256 \
--mode dev \
--topk 200

download_data.sh

There is one line missing at the end of download_data.sh:
gunzip docleaderboard-queries.tsv.gz

Training setup of ANCE and STAR

Hi, thank you for publishing the code for your interesting paper. I was just trying to reproduce STAR results in ANCE setup, i.e. I am using static hard negatives and in-batch negatives. But I am unable to achieve an MRR@10 score of 0.34. Also, the STAR checkpoint provided in this repo is not producing MRR@10 result of 0.34 when evaluated using ANCE repo. I am getting MRR@10 of 0.299 instead. I see there are some differences in the training setups in your repo and the ANCE one. Can you please highlight those?

Reproduce results

Hi Jingtao,

I try to reproduce the results showing in the README. The models are downloaded from google drive.
For the transformers version, preprocessing is 2.8.0 and for inference is 4.8.2.

I ran the following commands:
python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev
python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv

And I got the following results:
Eval Started
#####################
MRR @10: 0.010382669304589082
QueriesRanked: 6980
#####################

Could you help to figure out what I did wrong? Thanks!

Why there is cnt variable in get_collate_function?

In https://github.com/jingtaozhan/DRhard/blob/dc17f3d1f7f59d13d15daa1a728dc8d6efc48b92/dataset.py, if we take a look at the data collator,

def get_collate_function(max_seq_length):
    cnt = 0
    def collate_function(batch):
        nonlocal cnt
        length = None
        if cnt < 10:
            length = max_seq_length
            cnt += 1

        input_ids = [x["input_ids"] for x in batch]
        attention_mask = [x["attention_mask"] for x in batch]
        data = {
            "input_ids": pack_tensor_2D(input_ids, default=1, 
                dtype=torch.int64, length=length),
            "attention_mask": pack_tensor_2D(attention_mask, default=0, 
                dtype=torch.int64, length=length),
        }
        ids = [x['id'] for x in batch]
        return data, ids
    return collate_function  

we see that there is a cnt variable which is deciding if the collate_function should pad or not. I couldn't get why it is needed. Could you please explain the significance of cnt ?

Thank you
AM

RepBERT

Is the generating of passage_embeddings of this program the same as the RepBERT program?:

python precompute.py --load_model_path ./data/ckpt-350000 --task doc
python precompute.py --load_model_path ./data/ckpt-350000 --task query_dev.small
python precompute.py --load_model_path ./data/ckpt-350000 --task query_eval.small

Questions about Training pipeline

您好,我在尝试重复您的训练时,发现代码中有关训练的流程并没有文档来说明。请问您是否有时间可以整理一下相关说明文档呢?

msmarco-docs.tsv 的文件格式

你好,msmarco-docs.tsv 的文件格式是类似这样的吗?

https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR The hot glowing surfaces of...

前面的这个url好像直接int()解析成qid会报错?

preprocess.py 中
def QueryPreprocessingFn(args, line, tokenizer):
line_arr = line.split('\t')
q_id = int(line_arr[0])

passage = tokenizer.encode(
    line_arr[1].rstrip(),
    add_special_tokens=True,
    max_length=args.max_query_length,
    truncation=True)
passage_len = min(len(passage), args.max_query_length)
input_id_b = pad_input_ids(passage, args.max_query_length)

return q_id, input_id_b, passage_len

download.sh

Hi Jingtao,
Thanks for open-source this interesting work and detailed implementation.
However,There are many datasets that can no longer be found
such as:--2024-05-21 11:42:20-- https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-queries.tsv.gz
正在解析主机 msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4
正在连接 msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 404 The specified resource does not exist.
2024-05-21 11:42:22 错误 404:The specified resource does not exist.。

about the length of tokens

Hello,

I have read your paper and am quite interested in your work! There is a question about the tokens.
I notice you truncat the passage tokens with 120 in MSMARCO Passage Retrieval, however, for ANCE, the original paper uses 512 tokens. So does the number of tokens have the impact on the accuracy?

Issue in importing RobertaPreTrainedModel during STAR inference

Hi, I am trying to replicate your job and confront the following issue when executing
python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev

the error is:

Traceback (most recent call last):
  File "./star/inference.py", line 15, in <module>
    from model import RobertaDot
  File "./model.py", line 10, in <module>
    from transformers.modeling_roberta import RobertaPreTrainedModel
ImportError: cannot import name 'RobertaPreTrainedModel' from 'transformers.modeling_rob

this error happens for both transformers versions (2.8.0 and 3.4.0)

I found ANCE uses RobertaForSequenceClassification in RobertaDot_NLL_LN, can I use this class or some other RoBerta subclass instead?

best regards

model = RobertaDot.from_pretrained(model_path, config=config).cuda()

when I load the model file, I get the following problem which I don't understand:
Some weights of the model checkpoint at /home/tu/device/backup/wjjia/DRhard/adore-star were not used when initializing RobertaDot: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']

  • This IS expected if you are initializing RobertaDot from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing RobertaDot from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of RobertaDot were not initialized from the model checkpoint at /home/tu/device/backup/wjjia/DRhard/adore-star and are newly initialized: ['roberta.embeddings.position_ids']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Evaluatation on test passage dataset

Hello,
I found the result of proviededi nbatch-neg model on test dataset is so bad. Is TREC DL Passgae data a test dataset?
What I should do to reproduce the NDCG@10 and R@100 on TREC DL dataset?

  • Command:
    python ./msmarco_eval.py ./data/passage/preprocess/test-qrel.tsv ./data/passage/evaluate/download_inbatch/test.rank.tsv

  • Results:
    Eval Started
    #####################
    MRR @10: 0.04559967102039234
    QueriesRanked: 43
    #####################

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.