GithubHelp home page GithubHelp logo

microsoft / ance Goto Github PK

View Code? Open in Web Editor NEW
341.0 11.0 49.0 190 KB

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks

License: MIT License

Shell 3.67% Python 92.44% Jupyter Notebook 3.88%

ance's Introduction

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Lee Xiong*, Chenyan Xiong*, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk

This repo provides the code for reproducing the experiments in Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.

Our analyses further confirm that the negatives from sparse retrieval or other sampling methods differ drastically from the actual negatives in DR, and that ANCE fundamentally resolves this mismatch. We also show the influence of the asynchronous ANN refreshing on learning convergence and demonstrate that the efficiency bottleneck is in the encoding update, not in the ANN part during ANCE training. These qualifications demonstrate the advantages, perhaps also the necessity, of our asynchronous ANCE learning in dense retrieval.

What's new

Requirements

To install requirements, run the following commands:

git clone https://github.com/microsoft/ANCE
cd ANCE
python setup.py install

Data Download

To download all the needed data, run:

bash commands/data_download.sh 

Data Preprocessing

The command to preprocess passage and document data is listed below:

python data/msmarco_data.py 
--data_dir $raw_data_dir \
--out_data_dir $preprocessed_data_dir \ 
--model_type {use rdot_nll for ANCE FirstP, rdot_nll_multi_chunk for ANCE MaxP} \ 
--model_name_or_path roberta-base \ 
--max_seq_length {use 512 for ANCE FirstP, 2048 for ANCE MaxP} \ 
--data_type {use 1 for passage, 0 for document}

The data preprocessing command is included as the first step in the training command file commands/run_train.sh

Warmup for Training

ANCE training starts from a pretrained BM25 warmup checkpoint. The command with our used parameters to train this warmup checkpoint is in commands/run_train_warmup.py and is shown below:

    python3 -m torch.distributed.launch --nproc_per_node=1 ../drivers/run_warmup.py \
    --train_model_type rdot_nll \
    --model_name_or_path roberta-base \
    --task_name MSMarco \
    --do_train \
    --evaluate_during_training \
    --data_dir ${location of your raw data}  
    --max_seq_length 128 
    --per_gpu_eval_batch_size=256 \
    --per_gpu_train_batch_size=32 \
    --learning_rate 2e-4  \
    --logging_steps 100   \
    --num_train_epochs 2.0  \
    --output_dir ${location for checkpoint saving} \
    --warmup_steps 1000  \
    --overwrite_output_dir \
    --save_steps 30000 \
    --gradient_accumulation_steps 1 \
    --expected_train_size 35000000 \
    --logging_steps_per_eval 1 \
    --fp16 \
    --optimizer lamb \
    --log_dir ~/tensorboard/${DLWS_JOB_ID}/logs/OSpass

Training

To train the model(s) in the paper, you need to start two commands in the following order:

  1. run commands/run_train.sh which does three things in a sequence:

    a. Data preprocessing: this is explained in the previous data preprocessing section. This step will check if the preprocess data folder exists, and will be skipped if the checking is positive.

    b. Initial ANN data generation: this step will use the pretrained BM25 warmup checkpoint to generate the initial training data. The command is as follow:

     python -m torch.distributed.launch --nproc_per_node=$gpu_no ../drivers/run_ann_data_gen.py 
     --training_dir {# checkpoint location, not used for initial data generation} \ 
     --init_model_dir {pretrained BM25 warmup checkpoint location} \ 
     --model_type rdot_nll \
     --output_dir $model_ann_data_dir \
     --cache_dir $model_ann_data_dir_cache \
     --data_dir $preprocessed_data_dir \
     --max_seq_length 512 \
     --per_gpu_eval_batch_size 16 \
     --topk_training {top k candidates for ANN search(ie:200)} \ 
     --negative_sample {negative samples per query(20)} \ 
     --end_output_num 0 # only set as 0 for initial data generation, do not set this otherwise
    

    c. Training: ANCE training with the most recently generated ANN data, the command is as follow:

     python -m torch.distributed.launch --nproc_per_node=$gpu_no ../drivers/run_ann.py 
     --model_type rdot_nll \
     --model_name_or_path $pretrained_checkpoint_dir \
     --task_name MSMarco \
     --triplet {# default = False, action="store_true", help="Whether to run training}\ 
     --data_dir $preprocessed_data_dir \
     --ann_dir {location of the ANN generated training data} \ 
     --max_seq_length 512 \
     --per_gpu_train_batch_size=8 \
     --gradient_accumulation_steps 2 \
     --learning_rate 1e-6 \
     --output_dir $model_dir \
     --warmup_steps 5000 \
     --logging_steps 100 \
     --save_steps 10000 \
     --optimizer lamb 
    
  2. Once training starts, start another job in parallel to fetch the latest checkpoint from the ongoing training and update the training data. To do that, run

     bash commands/run_ann_data_gen.sh
    

    The command is similar to the initial ANN data generation command explained previously

Inference

The command for inferencing query and passage/doc embeddings is the same as that for Initial ANN data generation described above as the first step in ANN data generation is inference. However you need to add --inference to the command to have the program to stop after the initial inference step. commands/run_inference.sh provides a sample command.

Evaluation

The evaluation is done through "Calculate Metrics.ipynb". This notebook calculates full ranking and reranking metrics used in the paper including NDCG, MRR, hole rate, recall for passage/document, dev/eval set specified by user. In order to run it, you need to define the following parameters at the beginning of the Jupyter notebook.

    checkpoint_path = {location for dumpped query and passage/document embeddings which is output_dir from run_ann_data_gen.py}
    checkpoint =  {embedding from which checkpoint(ie: 200000)}
    data_type =  {0 for document, 1 for passage}
    test_set =  {0 for MSMARCO dev_set, 1 for TREC eval_set}
    raw_data_dir = 
    processed_data_dir = 

ANCE VS DPR on OpenQA Benchmarks

We also evaluate ANCE on the OpenQA benchmark used in a parallel work (DPR). At the time of our experiment, only the pre-processed NQ and TriviaQA data are released. Our experiments use the two released tasks and inherit DPR retriever evaluation. The evaluation uses the Coverage@20/100 which is whether the Top-20/100 retrieved passages include the answer. We explain the steps to reproduce our results on OpenQA Benchmarks in this section.

Download data

commands/data_download.sh takes care of this step.

ANN data generation & ANCE training

Following the same training philosophy discussed before, the ann data generation and ANCE training for OpenQA require two parallel jobs.

  1. We need to preprocess data and generate an initial training set for ANCE to start training. The command for that is provided in:
commands/run_ann_data_gen_dpr.sh

We keep this data generation job running after it creates an initial training set as it will later keep generating training data with newest checkpoints from the training process.

  1. After an initial training set is generated, we start an ANCE training job with commands provided in:
commands/run_train_dpr.sh

During training, the evaluation metrics will be printed to tensorboards each time it receives new training data. Alternatively, you could check the metrics in the dumped file "ann_ndcg_#" in the directory specified by "model_ann_data_dir" in commands/run_ann_data_gen_dpr.sh each time new training data is generated.

Results

The run_train.sh and run_ann_data_gen.sh files contain the command with the parameters we used for passage ANCE(FirstP), document ANCE(FirstP) and document ANCE(MaxP) Our model achieves the following performance on MSMARCO dev set and TREC eval set :

MSMARCO Dev Passage Retrieval MRR@10 Recall@1k Steps
ANCE(FirstP) 0.330 0.959 600K
ANCE(MaxP) - - -
TREC DL Passage NDCG@10 Rerank Retrieval Steps
ANCE(FirstP) 0.677 0.648 600K
ANCE(MaxP) - - -
TREC DL Document NDCG@10 Rerank Retrieval Steps
ANCE(FirstP) 0.641 0.615 210K
ANCE(MaxP) 0.671 0.628 139K
MSMARCO Dev Passage Retrieval MRR@10 Steps
pretrained BM25 warmup checkpoint 0.311 60K
ANCE Single-task Training Top-20 Top-100 Steps
NQ 81.9 87.5 136K
TriviaQA 80.3 85.3 100K
ANCE Multi-task Training Top-20 Top-100 Steps
NQ 82.1 87.9 300K
TriviaQA 80.3 85.2 300K

Click the steps in the table to download the corresponding checkpoints.

Our result for document ANCE(FirstP) TREC eval set top 100 retrieved document per query could be downloaded here. Our result for document ANCE(MaxP) TREC eval set top 100 retrieved document per query could be downloaded here.

The TREC eval set query embedding and their ids for our passage ANCE(FirstP) experiment could be downloaded here. The TREC eval set query embedding and their ids for our document ANCE(FirstP) experiment could be downloaded here. The TREC eval set query embedding and their ids for our document 2048 ANCE(MaxP) experiment could be downloaded here.

The t-SNE plots for all the queries in the TREC document eval set for ANCE(FirstP) could be viewed here.

run_train.sh and run_ann_data_gen.sh files contain the commands with the parameters we used for passage ANCE(FirstP), document ANCE(FirstP) and document 2048 ANCE(MaxP) to reproduce the results in this section. run_train_warmup.sh contains the commands to reproduce the results for the pretrained BM25 warmup checkpoint in this section

Note the steps to reproduce similar results as shown in the table might be a little different due to different synchronizing between training and ann data generation processes and other possible environment differences of the user experiments.

ance's People

Contributors

ccl13 avatar jialliu avatar microsoft-github-operations[bot] avatar microsoftopensource avatar shuqilu avatar xiongchenyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ance's Issues

module 'transformers' has no attribute 'TFRobertaDot_NLL_LN'

HI,
I met a problem when I was running the run_train_warmup.sh,it appeared:

Traceback (most recent call last):
File "../drivers/run_warmup.py", line 758, in
main()
File "../drivers/run_warmup.py", line 733, in main
config, tokenizer, model, configObj = load_stuff(
File "../drivers/run_warmup.py", line 312, in load_stuff
model = configObj.model_class.from_pretrained(
File "/home/coseven/anaconda3/lib/python3.8/site-packages/transformers-2.3.0-py3.8.egg/transformers/modeling_utils.py", line 432, in from_pretrained
model = load_tf2_checkpoint_in_pytorch_model(model, resolved_archive_file, allow_missing_keys=True)
File "/home/coseven/anaconda3/lib/python3.8/site-packages/transformers-2.3.0-py3.8.egg/transformers/modeling_tf_pytorch_utils.py", line 205, in load_tf2_checkpoint_in_pytorch_model
tf_model_class = getattr(transformers, tf_model_class_name)
AttributeError: module 'transformers' has no attribute 'TFRobertaDot_NLL_LN'

but my transformers version is 2.3.0,
can you help me with this ? I don't know what to do .
Wish your reply

data preprocess and inference

Hi,

I download the collectionandqueries.tar.gz, extract the files to data/msmarco/ and run

python data/msmarco_data.py --data_dir data/msmarco/ --out_data_dir data/msmarco_preprocessed --model_type rdot_nll --model_name_or_path roberta-base --max_seq_length 512 --data_type 1. 

the following is returned:

...
Process Process-55:
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 339, in tokenize_to_file
    with open(in_path, 'r', encoding='utf-8') if in_path[-2:] != "gz" else gzip.open(in_path, 'rt', encoding='utf8') as in_f,\
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco/queries.train.shuf.tsv'
start merging splits
Traceback (most recent call last):
  File "data/msmarco_data.py", line 436, in <module>
    main()
  File "data/msmarco_data.py", line 432, in main
    preprocess(args)
  File "data/msmarco_data.py", line 212, in preprocess
    "train-qrel.tsv")
  File "data/msmarco_data.py", line 66, in write_query_rel
    out_query_path, 32, 8 + 4 + args.max_query_length * 4):
  File "/home3/zhangkaitao/ANCE/utils/util.py", line 246, in numbered_byte_file_generator
    with open('{}_split{}'.format(base_path, i), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/msmarco_preprocessed/train-query_split0'

Then I download the passage_ance_firstP_checkpoint, change the path in run_ann_data_gen.sh and run

sh run_ann_data_gen.sh

and get this:

07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
07/10/2020 21:56:35 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 21:56:35 - INFO - __main__ -   starting output number 0
07/10/2020 21:56:35 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/msmarco/OSPass512/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data/', '--cache_dir', '../data/msmarco/OSPass512/ann_data/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20']' returned non-zero exit status 1.

Then I move the qrels.train.tsv in data/msmarco/ to the preprocessed folder, change its name to train-qrel.tsv, but it doesn't help. Finally I try

sh run_inference.sh

and get this:

07/10/2020 22:07:51 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True
07/10/2020 22:07:51 - INFO - __main__ -   starting output number 0
07/10/2020 22:07:51 - INFO - __main__ -   Loading query_2_pos_docid
Traceback (most recent call last):
  File "../drivers/run_ann_data_gen.py", line 698, in <module>
    main()
  File "../drivers/run_ann_data_gen.py", line 694, in main
    ann_data_gen(args)
  File "../drivers/run_ann_data_gen.py", line 662, in ann_data_gen
    training_positive_id, dev_positive_id = load_positive_ids(args)
  File "../drivers/run_ann_data_gen.py", line 79, in load_positive_ids
    with open(query_positive_id_path, 'r', encoding='utf8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$../data/msmarco_preprocessed/train-qrel.tsv'
Traceback (most recent call last):
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/zhangkaitao/.conda/envs/new/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zhangkaitao/.conda/envs/new/bin/python', '-u', '../drivers/run_ann_data_gen.py', '--local_rank=3', '--training_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--init_model_dir', '../data/Passage_ANCE_FirstP_Checkpoint/', '--model_type', 'rdot_nll', '--output_dir', '../data/msmarco/OSPass512/ann_data_inf/', '--cache_dir', '../data/msmarco/OSPass512/ann_data_inf/cache/', '--data_dir', '$../data/msmarco_preprocessed/', '--max_seq_length', '512', '--per_gpu_eval_batch_size', '16', '--topk_training', '200', '--negative_sample', '20', '--end_output_num', '0', '--inference']' returned non-zero exit status 1.

Could you help me with this? Thank you:)

Download link not working

Hi, the TREc eval set query embedding and their ids provided in the README documentation cannot be downloaded, can you re-share it? Thank you!

This XML file does not appear to have any style information associated with it. The document tree is shown below.

ResourceNotFound
The specified resource does not exist. RequestId:7ad0d3db-f01e-000b-3423-611dbb000000 Time:2023-03-28T03:14:57.8906883Z

Using testset to test NDCG while training

Hi,

I found that while doing the TREC DL document task, the code in the msmarco.py processes "msmarco-test2019-queries.tsv" as the dev-query file.

if args.data_type == 0:

    if args.data_type == 0:
        write_query_rel(
            args,
            pid2offset,
            "msmarco-doctrain-queries.tsv",
            "msmarco-doctrain-qrels.tsv",
            "train-query",
            "train-qrel.tsv")
        write_query_rel(
            args,
            pid2offset,
            "msmarco-test2019-queries.tsv",
            "2019qrels-docs.txt",
            "dev-query",
            "dev-qrel.tsv")

If I want to reproduce your work, is it okay to use the "msmarco-docdev-queries.tsv" as devset to select the best checkpoint?

PYTHONPATH should be set explicitly

Hello,

I get this error when I follow the instructions to run the code:

  File "drivers/run_warmup.py", line 14, in <module>
    from utils.eval_mrr import passage_dist_eval
ModuleNotFoundError: No module named 'utils'

Users need to run this command before to eliminate that error:

export PYTHONPATH=${PYTHONPATH}:`pwd`

I suggest adding that command explicitly in README.md.

where is bm25 introduced?

Hi,

For the warm-up step, I see a regular dense retrieval model training on the triples.small data provided by MSMarco.

But I don't find any code introducing bm25 index and bm25 sampling.
I guess you are treating triples.small data's negatives as bm25 negs already?

What does bm25 warm up mean? How is that introduced?

Thanks

ANCE encoders

Hello, I have a question about the BERT encoders. In the paper, it is said that "ANCE can be used to train any dense retrieval model. For simplicity, we use a simple set up in recent research (Luan et al., 2020) with BERT Siamese/Dual Encoder (shared between q and d), dot product similarity, and negative log likelihood (NLL) loss." So actually, only one encoder is used to encode queries and documents separately. However, in the "model.py", the "BiEncoder" is as follows:

class BiEncoder(nn.Module):
    """ Bi-Encoder model component. Encapsulates query/question and context/passage encoders.
    """
    def __init__(self, args):
        super(BiEncoder, self).__init__()
        self.question_model = HFBertEncoder.init_encoder(args)
        self.ctx_model = HFBertEncoder.init_encoder(args)

There are two encoders are defined.

Can't reproduce the performance of warmup(60k)

Hello, I used run_train_warmup.sh to train the warmup model and found that the performance of my model can not achieve the effect of your released checkpoint (pretrained BM25 warmup checkpoint MRR@10 is 0.311), even if I train it to 300k steps (MRR@10 is 0.2979). All training hyperparameters are as described in run_train_warmup.sh, how can I handle this?

DPR checkpoint on MSMARCO?

Hello,

I recently read your wonderful ANCE paper. To the best of my knowledge, this is the only paper which included results of DPR trained on MS MARCO passage retrieval dataset. But I can only find your ANCE checkpoint in the repo.

Would you mind sharing the DPR checkpoint as well? Really appreciate your help!

How long does inference take?

Hello developers,
I followed the guidelines in your ReadMe to generate the dense representations for MS Marco Document Ranking, using the MaxP checkpoint that you provide. My process has been running for more than 80 hours, on a server with a T4 Tesla GPU and Intel Xeon Platinum CPU (looking at htop, I observe that it is running with a single thread). Is such a long inference time normal? Am I missing something to speedup this process?

Issue during downloading "roberta-base-config.json"

Hey,

I have a problem with running run_warmup.py.

`During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "../drivers/run_warmup.py", line 756, in
main()
File "../drivers/run_warmup.py", line 732, in main
args.train_model_type, args)
File "../drivers/run_warmup.py", line 304, in load_stuff
cache_dir=args.cache_dir if args.cache_dir else None,
File "/scratch/h2amer/ahamsala/torch_DPR/lib/python3.6/site-packages/transformers/configuration_utils.py", line 176, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/scratch/h2amer/ahamsala/torch_DPR/lib/python3.6/site-packages/transformers/configuration_utils.py", line 243, in get_config_dict
raise EnvironmentError(msg)
OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json' to download pretrained model configuration file.`

CUDA nccl library issue

Hello,

I cloned this repository because I am interested in running the run_inference.sh command. I followed the steps listed in the readme. However, when I run run_inference, I got the following error

RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

My system has NCCL v2.7.8 correctly installed with the corresponding CUDA toolkit.

What am I missing here?

thanks in advance for the help.

best,

Franco Maria

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.