GithubHelp home page GithubHelp logo

recsys2024_ctr_challenge's Introduction

RecSys2024_CTR_Challenge

The RecSys 2024 Challenge: https://www.recsyschallenge.com/2024/

The Ekstra Bladet RecSys Challenge aims to predict which article a user will click on from a list of articles that were seen during a specific impression. Utilizing the user's click history, session details (like time and device used), and personal metadata (including gender and age), along with a list of candidate news articles listed in an impression log, the challenge's objective is to rank the candidate articles based on the user's personal preferences.

This baseline is built on top of FuxiCTR, a configurable, tunable, and reproducible library for CTR prediction. The library has been selected among the list of recommended evaluation frameworks by the ACM RecSys Conference. By using FuxiCTR, we develop a simple yet strong baseline (AUC: 0.7154) without heavy tuning. We open source the code to help beginers get familar with FuxiCTR and quickly get started on this task.

🔥 If you find our code helpful in your competition, please cite the following paper:

Data Preparation

Note that the dataset is quite large. Preparing the full dataset needs about 1T disk space. Although some optimizations can be made to save space (e.g., store sequence features sperately), we leave it for future exploration.

  1. Download the datasets at: https://recsys.eb.dk/#dataset

  2. Unzip the data files to the following

    cd ~/RecSys2024_CTR_Challenge/data/Ebnerd/
    find -L .
    
    .
    ./train
    ./train/history.parquet
    ./train/articles.parquet
    ./train/behaviors.parquet
    ./validation
    ./validation/history.parquet
    ./validation/behaviors.parquet
    ./test
    ./test/history.parquet
    ./test/articles.parquet
    ./test/behaviors.parquet
    ./image_embeddings.parquet
    ./contrastive_vector.parquet
    ./prepare_data_v1.py
  3. Convert the data to csv format

    cd ~/RecSys2024_CTR_Challenge/data/Ebnerd/
    python prepare_data_v1.py

Environment

Please set up the environment as follows. We run the experiments on a P100 GPU server with 16G GPU memory and 750G RAM.

  • torch==1.10.2+cu113
  • fuxictr==2.2.3
conda create -n fuxictr python==3.9
pip install -r requirements.txt
source activate fuxictr

Version 1

  1. Train the model on train and validation sets:

    python run_param_tuner.py --config config/DIN_ebnerd_large_x1_tuner_config_01.yaml --gpu 0
    

    We get validation avgAUC: 0.7113. Note that in FuxiCTR, AUC is the global AUC, while avgAUC is averaged over impression ID groups.

  2. Make predictions on the test set:

    Get the experiment_id from running logs or the result csv file, and then you can run prediction on the test.

    python submit.py --config config/DIN_ebnerd_large_x1_tuner_config_01 --expid DIN_ebnerd_large_x1_001_1860e41e --gpu 1
    
  3. Make a submission. We get test AUC: 0.7154.

Potential Improvements

  • To build the baseline, we simply reuse the DIN model, which is popular for sequential user interest modeling. We encourage to explore some other alternatives for user behavior sequence modeling.
  • We currently only consider the click behaviors, but leave out other important singnals of reading times and percentiles. It is desired to consider them with multi-objective modeling.
  • We use contrast vectors and image embeddings in a straightforward way. It is interesting to explore other embedding features.
  • How to bridge the user sequence modeling with large pretrained models (e.g., Bert, LLMs) is a promising direction to explore.

Discussion

We also welcome contributors to help improve the space and time efficiency of FuxiCTR for handling large-scale sequence datasets. If you have any question, please feel free to open an issue.

recsys2024_ctr_challenge's People

Contributors

xpai avatar

Stargazers

weijie_hong avatar LoveMaker avatar Konstantin Eremin avatar ds wook avatar  avatar Lee Seungyoon avatar Yu-Zhou avatar caojiangxia avatar WAHAHA avatar Zhimin Lin avatar zhujiem avatar

Watchers

zhujiem avatar

recsys2024_ctr_challenge's Issues

Ebnerd datasets timestamp info

  • Behaviors
    Train: (7 days)
    2023-05-18 07:00:00
    2023-05-25 06:59:59
    Valid: (7 days)
    2023-05-25 07:00:00
    2023-06-01 06:59:59
    Test: (7 days)
    2023-06-01 07:00:00
    2023-06-08 06:59:59
  • History
    Train: (seq_len=[5, 2696])
    2023-04-27 07:00:00
    2023-05-18 06:59:59
    Valid: (seq_len=[5, 1752])
    2023-05-04 07:00:00
    2023-05-25 06:59:59
    Test: (seq_len=[5, 1530])
    2023-05-11 07:00:00
    2023-06-01 06:59:59

Train Model Error

When we train your model on small train and validation sets, a error occurred:
微信图片_20240613210536
Exceptions were threw out when the program preprocessed test.csv. However, we found all the queries are '0'(not 'false') in click column. Can you solve this for us?😭

Problems encountered in the prediction process

When executing submit.py, the model parameters loaded from
微信截图_20240619225137
"checkpoints/ebnerd_large_x1_2ed787f6/DIN_ebnerd_large_x1_001_1860e41e.model" are inconsistent with the DIN model structure

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.