GithubHelp home page GithubHelp logo

adedzy / sigir19-bert-ir Goto Github PK

View Code? Open in Web Editor NEW
157.0 10.0 40.0 450 KB

Repo of code and data for SIGIR-19 short paper "Deeper Text Understanding for IR with Contextual NeuralLanguage Modeling"

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

sigir19-bert-ir's Introduction

SIGIR19-BERT-IR: Deeper Text Understanding for IR with Contextual Neural Language Modeling

Repo of code and data for SIGIR-19 short paper "Deeper Text Understanding for IR with Contextual Neural Language Modeling"

Find the paper on arXiv

Abstract: Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.

Nov 27, 2019: We received several very good questions in'Issues'. Check there for information about data preprocesssing/post-processing! -Zhuyun

Data

Data can be downloaded from our Virtual Appendix.

The input to the BERT re-ranker is a list of .trec.with_json files. Each line is in the form of:

qid Q0 docid ranke score runname # {"doc":{"title":"...", "body":"......"}}.

E.g. A document:

80 Q0 clueweb09-en0008-49-09144 1 -5.66498569 indri # {doc": {"title": "Personal Keyboards reviews - Keyboard-Reviews.com", "body": "personal keyboards reviews accessories bass guitars , a..."}

A passage:

80 Q0 clueweb09-en0008-49-09144_passage-0 1 -5.66498569 passage # {"doc": {"title": "Personal Keyboards reviews - Keyboard-Reviews.com", "body": "personal keyboards reviews...}}

We release these .trec.with_json files for ClueWeb09-B. We cannot release the document contents of Robust04 documents, but here is a small sample of Robust04 .trec.with_json file. As an alternative, we provide the inital rankings for ClueWeb09/Robust04 (.trec files). Each line is the format of:

qid Q0 docid rank score runname

You need to get the text contents of candidate documents and append them to the trec file in json format ({doc":{"title":"...", "body":"......"}}).

Once you have generated the .trec.with_json files for documents, you can use the provided passage generation script to generate passages

Google Colab notebooks to train BERT

You can upload the .trec.with_json files to Google cloud bucket, and directly run the notebooks:

  1. ClueWeb09-B Document Level Train/Inference (BERT-FirstP)
  2. ClueWeb09-B Passage Level Train/Inference (BERT-MaxP, BERT-SumP)
  3. Robust04 Document Level Train/Inference (BERT-FirstP
  4. Robust04 Passage Level Train/Inference (BERT-maxP, BERT-SumP)

The output is a file of scores for each document/passage. It need to be aligned with the document/passage ids in the original .trec.with_json file. We provide scripts for this purpose.

Pre-trained Bing-augmented BERT Model

Some search tasks require both general text understanding (e.g. Honda is a motor company) and more-specific search knowledge (e.g. people want to see special offers about Honda). While pre-trained BERT encodes general language patterns, the search knowledge must be learned from labeled search data. We follow the domain adaptation setting from our WSDM2018 Conv-KNRM work and augmented BERT with search knowledge from a sample of Bing search log.

The Bing-augmented BERT model can be downloaded from our Virtual Appendix

sigir19-bert-ir's People

Contributors

adedzy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigir19-bert-ir's Issues

Number of predict examples is not correct

Hi
In the Inference phase, there is a line as followed
predict_examples = processor.get_test_examples(TASK_DATA_DIR)
My path of TASK_DATA_DIR and xxx.trec.with_json file are all correct. In the function of get_test_examples(self, data_dir), every line has no error, the json resolution also goes well with the line of " json_dict = json.loads('#'.join(items[1:]))".

But, in "examples.append( InputExample(guid=guid, text_a_list=q_text_list, text_b=d, label=label) )", the examples only gets 1000 lines in my xxx.trec.with_json, which has 9000 lines.
Can I ask that why it just obtains 1000 lines ?

.trec.with_json

Hi, I have read the README.md several times but still don't know how to get .trec.with_json. May be I missed something, Can you tell me some details about it, thanks.

Run query opt by Indri

Hello, your work is very creative. I am new to Indri. I would like to ask, what is the configuration file or command for indexing and querying in your experiment? Looking forward to your answer~

Why FLAGS.max_seq_length=128?

Hi!
I am confused why in the run_qe_classifier.py FLAGS.max_seq_length is set to 128? The sliding window length is 150 that is longer than max_seq_length.

Thanks!

Where is nDCG evaluation?

In run_qe_classifier.py there is only accuracy when evaluating the model. Can you provide any code on how you do nDCG@20 evaluation in your paper?

Negative samples

Hello, I am a postgraduate student. I intend to reproduce this method. In your code you mention “for training, we use negative samples from the top 1000 documents”. May I ask how these samples are selected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.