GithubHelp home page GithubHelp logo

baidu_ultr_dataset's Introduction

A Large Scale Search Dataset from Baidu Search Engine

This repo contains code & dataset accompaning the paper, A Large Scale Search Dataset for Unbiased Learning to Rank.

Update

The dataset is currently inaccessible for download due to legal restrictions with Baidu. Should you require access to the dataset, please contact the author (zoulixin15 AT gmail.com) to request a private download link.

Dependencies

This code requires the following:

  • Python 3.6+
  • Pytorch 1.10.2 + CUDA 10.2

Quick Start

0. Prepare the corpus

Suppose your have downloaded the Web Search Session Data (training data) and annotation_data_0522.txt (test data) on Google drive.

Moreover, we provide the resource for those who cannot access google drive. training data test data unigram dict.

First, move all the zip file into dir './data/train_data/', e.g.,

mv yourpath/*.gz ./data/train_data/

Second, move the file part-00000.gz into './data/click_data/', we will treat it as one of the validation set.

mv ./data/train_data/part-00000.gz ./data/click_data/part-00000.gz

Finally, split the annotated data annotation_data_0522.txt into test and validation set. Move them into dir './data/annotate_data/'

mv test_data.txt ./data/annotate_data/ mv val_data.txt ./data/annotate_data/

Pretrain Transformer

We adopt two tasks CTR prediction and MLM to pretrain the Transformer. For example, to pretrain a 12-Layers Transformer, you can type the following command

python pretrain.py --emb_dim 768 --nlayer 12 --nhead 12 --dropout 0.1

Training Baselines

We select five representative baselines to test this dataset, which are Naive, IPW, DLA, REM and PairD. The code and hyper-parameters setting refer to ULTR-Community. For example, to train a base model IPW, you can type the following command

python finetune.py --method_name IPW

The explanation of the input parameters, you can refer to args.py.

The Pre-trained Language Model

You can download the pre-trained language model from the table below:

Head=12
Layer=12 Baidu_ULTR_Base_12L_12H_768Emb
Layer=6 Baidu_ULTR_Base_6L_12H_768Emb
Layer=3 Baidu_ULTR_Base_3L_12H_768Emb

Train Data --- Large Scale Web Search Session Data

The large scale web search session are available at here. The search session is organized as:

Qid, Query, Query Reformulation
Pos 1, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
Pos 2, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
......
Pos N, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
# SERP is the abbreviation of search result page.
Column Id Explaination Remark
Qid query id
Query The user issued query Sequential token ids separated by "\x01".
Query Reformulation The subsequent queries issued by users under the same search goal. Sequential token ids separated by "\x01".
Pos The document’s displaying order on the screen. [1,30]
Url_md5 The md5 for identifying the url
Title The title of document. Sequential token ids separated by "\x01".
Abstract A query-related brief introduction of the document under the title. Sequential token ids separated by "\x01".
Multimedia Type The type of url, for example, advertisement, videos, maps. int
Click Whether the user clicked the document. [0,1]
- - -
- - -
Skip Whether the user skipped the document on the screen. [0,1]
SERP Height The vertical pixels of SERP on the screen. Continuous Value
Displayed Time The document's display time on the screen. Continuous Value
Displayed Time Middle The document’s display time on the middle 1/3 of the screen. Continuous Value
First Click The identifier of users’ first click in a query. [0,1]
Displayed Count The document’s display count on the screen. Discrete Number
SERP's Max Show Height The max vertical pixels of SERP on the screen. Continuous Value
Slipoff Count After Click The count of slipoff after user click the document. Discrete Number
Dwelling Time The length of time a user spends looking at a document after they’ve clicked a link on a SERP page, but before clicking back to the SERP results. Continuous Value
Displayed Time Top The document’s display time on the top 1/3 of screen. Continuous Value
SERP to Top The vertical pixels of the SERP to the top of the screen. Continuous Value
Displayed Count Top The document’s display count on the top 1/3 of screen. Discrete Number
Displayed Count Bottom The document’s display count on the bottom 1/3 of screen. Discrete Number
Slipoff Count The count of document being slipped off the screen.
- - -
Final Click The identifier of users’ last click in a query session.
Displayed Time Bottom The document’s display time on the bottom 1/3 of screen. Continuous Value
Click Count The document’s click count. Discrete Number
Displayed Count The document’s display count on the screen. Discrete Number
- - -
Last Click The identifier of users’ last click in a query. Discrete Number
Reverse Display Count The document’s display count of user view with a reverse browse order from bottom to the top. Discrete Number
Displayed Count Middle The document’s display count on the middle 1/3 of screen. Discrete Number
- - -

Test Data --- Expert Annotation Dataset for Validation,

The expert annotation dataset is aviable at here. The Schema of the annotation_data_0522.txt:

Columns Explaination Remark
Qid The uniq id for every query. An uniq id. There are some queries (8% queries) with the same qids, which might slightly influence the evaluating score. Please directly using the query as the indicator!
Query The user issued query Sequential token ids separated by "\x01".
Title The title of document. Sequential token ids separated by "\x01".
Abstract A query-related brief introduction of the document under the title. Sequential token ids separated by "\x01".
Label Expert annotation label. [0,4]
Bucket The queries are descendingly split into 10 buckets according to their monthly search frequency, i.e., bucket 0, bucket 1, and bucket 2 are high-frequency queries while bucket 7, bucket 8, and bucket 9 are the tail queries [0,9]

The unigram_dict_0510_tokens.txt is a unigram set that records the high-frequency words using the desensitization token id.

If you use this dataset of our reproduced results, please cite:

  • A Large Scale Search Dataset for Unbiased Learning to Rank. Lixin Zou*, Haitao Mao*, Xiaokai Chu, Jiliang Tang, Wenwen Ye, Shuaiqiang Wang, and Dawei Yin.(*: equal contributions)

  • The BibTex infomation is detached as:

@inproceedings{
    zou2022large,
    title={A Large Scale Search Dataset for Unbiased Learning to Rank},
    author={Lixin Zou and Haitao Mao andXiaokai Chu and Jiliang Tang and Wenwen Ye and Shuaiqiang Wang and Dawei Yin},
    booktitle={NeurIPS 2022},
    year={2022}
}

Contact

To ask questions or report issues, please open an issue on the issues tracker.

baidu_ultr_dataset's People

Contributors

chuxiaokai avatar haitaomao avatar zoulixin93 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

baidu_ultr_dataset's Issues

Question about the Reformulated Queries

Hi,

It seems that the subsequent reformulated queries are not separated. Only the tokens are separated.

For example, here is the first line from part-00000

b'10000014169022957140\t4241\x015865\x013472\x0112631\x012962\x018468\x0116789\t4241\x015865\x013472\x0112631\x019066\x0112307\x017966\x016488\x016145\x012689\x014019\x0118161\x0121376\x014241\x015865\x013472\x0112631\x012962\x018468\x0116789\x0121376\x0115038\x0110191\x011251\x016488\x019066\x0112307\x017966\x016488\n'

The query id is given by

'10000014169022957140'

The query is given by a list of tokens, i.e.,

'4241\x015865\x013472\x0112631\x012962\x018468\x0116789'

The subsequent reformulated queries are also given by a list of tokens, i.e.,

'4241\x015865\x013472\x0112631\x019066\x0112307\x017966\x016488\x016145\x012689\x014019\x0118161\x0121376\x014241\x015865\x013472\x0112631\x012962\x018468\x0116789\x0121376\x0115038\x0110191\x011251\x016488\x019066\x0112307\x017966\x016488'

All those tokens are separated by '\x01'.

It seems that this list may include multiple subsequent reformulated queries because two 4241 appears. Am I correct?

If it is true, how can we split this list into multiple queries?

If it is possible, I would suggest authors include an id to indicate queries under the same search goal. Thanks!

关于task2的一些疑问

举办方你好~ 我想请问一下在完成task2的过程中,是否对finetune阶段使用的方法有所限制呢,因为我理解如果不加限制,那么task1和task2是不是相辅相成的~ 在task1去偏榜单中得到一个好的成绩,如果在预训练模型不变的情况下,在task2的榜单也可以得到一个好的成绩😂

Download links not working.

Hi, I tried to download your dataset, but none of the links are working for me.
It downloads only empty zip archives.
Google drives returns 404.

Document Order in Test Dataset

Thank you very much for making this dataset public. I have a quick question about the test dataset. I understand that for each query you take the top30 documents from the Search engine + another ~30 from the top 1000. When these results are presented to the experts, is the order shuffled? 0r are they shown in their original order? Thx for your help

关于' Displayed Count'特征的疑问

在数据探索过程中发现训练集合中对于文档存在两个命名相同的特征列' Displayed Count', 在数据集合的网页的列名解释中也发现了'Displayed Count'这一特征出现了两次,实际检查特征对应的值时确发现这两列的数值并不完全一致,请问是什么造成了这样的现象呢,或者说应该以哪一列的数值为准. 以下是part-00001.gz 的结果展示
image

运行中遇到的bug

主办方你好,我在运行unbiased_learning.py的时候,代码会有如下报错:
image

我看了一下可能是因为BaseAlgorithm类继承的是ABC,而不是nn.Module,所以它没有state_dict.
请问这个可以怎么解决,要修改代码框架吗

where is finetune.py

In README

section Training Baselines
python finetune.py --method_name IPW

But I can't find the file finetune.py

How to use other features?

Hi, I find we only use query, title, abstract to train the model.
But there is a lot of other features in dataset, including Continuous Value and Discrete Number.

How could I use these features to finetune an unbiased LTR model such as dla?

It seems very strange to concat the fc3 output of Transformer4Ranking/model.py and the original feature values.

Hope to get your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.