GithubHelp home page GithubHelp logo

gueneumann / neural-corpus-indexer-nci Goto Github PK

View Code? Open in Web Editor NEW

This project forked from solidsea98/neural-corpus-indexer-nci

0.0 0.0 0.0 589 KB

Shell 0.06% Python 98.93% Jupyter Notebook 1.01%

neural-corpus-indexer-nci's Introduction

[๐Ÿ‘‘ NeurIPS 2022 Outstanding Paper] A Neural Corpus Indexer for Document Retrieval -- NCI (Paper)

made-with-python

What is NCI?

NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:

Model Recall@1 Recall@10 Recall@100 MRR@100
NCI w/ qg-ft (Ensemble) 72.78 91.76 96.22 80.12
NCI (Ensemble) 70.46 89.35 94.75 77.82
NCI w/ qg-ft (Large) 68.65 88.45 94.53 76.10
NCI w/ qg-ft (Base) 68.91 88.48 94.48 76.17
NCI (Large) 66.23 85.27 92.49 73.37
NCI (Base) 65.86 85.20 92.42 73.12
DSI (T5-Base) 27.40 56.60 -- --
DSI (T5-Large) 35.60 62.60 -- --
SEAL (Large) 59.93 81.24 90.93 67.70
ANCE (MaxP) 52.63 80.38 91.31 62.84
BM25 + DocT5Query 35.43 61.83 76.92 44.47

For more information, checkout our publications: https://arxiv.org/abs/2206.02743

Environemnt

[1] Install Anaconda.

[2] Clone repository:

git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI

[3] Create conda environment:

conda env create -f environment.yml
conda activate NCI

[4] Docker:

If necessary, the NCI docker is mzmssg/corpus_env:latest.

Data Process

You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

[1] Dataset Download.

Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.

[2] Semantic Identifier

NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.

[3] Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

NCI uses docTTTTTquery checkpoint to generate synthetic queries. If you finetune docTTTTTquery checkpoint, the query generation files can make the retrieval result even better. We show how to finetune the model. The following command will finetune the model for 4k iterations to predict queries. We assume you put the tsv training file in gs://your_bucket/qcontent_train_512.csv (download from above). Also, change your_tpu_name, your_tpu_zone, your_project_id, and your_bucket accordingly.

t5_mesh_transformer  \
  --tpu="your_tpu_name" \
  --gcp_project="your_project_id" \
  --tpu_zone="your_tpu_zone" \
  --model_dir="gs://your_bucket/models/" \
  --gin_param="init_checkpoint = 'gs://your_bucket/model.ckpt-1004000'" \
  --gin_file="dataset.gin" \
  --gin_file="models/bi_v1.gin" \
  --gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
  --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
  --gin_param="tsv_dataset_fn.filename = 'gs://your_bucket/qcontent_train_512.csv'" \
  --gin_file="learning_rate_schedules/constant_0_001.gin" \
  --gin_param="run.train_steps = 1008000" \
  --gin_param="tokens_per_batch = 131072" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology ='v2-8'"

Please refer to docTTTTTquery documentation.

Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').

Evaluation

Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.

Citation

If you find this work useful for your research, please cite:

@article{wang2022neural,
  title={A neural corpus indexer for document retrieval},
  author={Wang, Yujing and Hou, Yingyan and Wang, Haonan and Miao, Ziming and Wu, Shibin and Chen, Qi and Xia, Yuqing and Chi, Chengmin and Zhao, Guoshuai and Liu, Zheng and others},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={25600--25614},
  year={2022}
}

Acknowledgement

We learned a lot and borrowed some code from the following projects when building NCI.

neural-corpus-indexer-nci's People

Contributors

solidsea98 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.