GithubHelp home page GithubHelp logo

diffrec's Introduction

Diffusion Recommender Model

This is the pytorch implementation of our paper at SIGIR 2023:

Diffusion Recommender Model

Wenjie Wang, Yiyan Xu, Fuli Feng, Xinyu Lin, Xiangnan He, Tat-Seng Chua

Environment

  • Anaconda 3
  • python 3.8.10
  • pytorch 1.12.0
  • numpy 1.22.3

Usage

Data

The experimental data are in './datasets' folder, including Amazon-Book, Yelp and MovieLens-1M. Note that the item embedding files of Amazon-book for clean setting and noisy setting are not here due to filesize limits, which are available at OneDrive. Those item embeddings used in L-DiffRec are derived from a pre-trained LightGCN specific to each dataset.

Note that the results on ML-1M differ from those reported in CODIGEM, owing to different data processing procedures. CODIGEM did not sort and split the training/testing sets according to timestamps; however, temporal splitting aligns better with the real-world testing.

Training

To reproduce the results and perform fine-tuning of the hyperparameters, please refer to the model name specified in the inference.py file. Ensure that the hyperparameter 'noise_min' is set to a value lower than 'noise_max'.

DiffRec

cd ./DiffRec
python main.py --cuda --dataset=$1 --data_path=../datasets/$1/ --lr=$2 --weight_decay=$3 --batch_size=$4 --dims=$5 --emb_size=$6 --mean_type=$7 --steps=$8 --noise_scale=$9 --noise_min=${10} --noise_max=${11} --sampling_steps=${12} --reweight=${13} --log_name=${14} --round=${15} --gpu=${16}

or use run.sh

cd ./DiffRec
sh run.sh dataset lr weight_decay batch_size dims emb_size mean_type steps noise_scale noise_min noise_max sampling_steps reweight log_name round gpu_id

L-DiffRec

cd ./L-DiffRec
python main.py --cuda --dataset=$1 --data_path=../datasets/$1/ --emb_path=../datasets/ --lr1=$2 --lr2=$3 --wd1=$4 --wd2=$5 --batch_size=$6 --n_cate=$7 --in_dims=$8 --out_dims=$9 --lamda=${10} --mlp_dims=${11} --emb_size=${12} --mean_type=${13} --steps=${14} --noise_scale=${15} --noise_min=${16} --noise_max=${17} --sampling_steps=${18} --reweight=${19} --log_name=${20} --round=${21} --gpu=${22}

or use run.sh

cd ./L-DiffRec
sh run.sh dataset lr1 lr2 wd1 wd2 batch_size n_cate in_dims out_dims lamda mlp_dims emb_size mean_type steps noise_scale noise_min noise_max sampling_steps reweight log_name round gpu_id

T-DiffRec

cd ./T-DiffRec
python main.py --cuda --dataset=$1 --data_path=../datasets/$1/ --lr=$2 --weight_decay=$3 --batch_size=$4 --dims=$5 --emb_size=$6 --mean_type=$7 --steps=$8 --noise_scale=$9 --noise_min=${10} --noise_max=${11} --sampling_steps=${12} --reweight=${13} --w_min=${14} --w_max=${15} --log_name=${16} --round=${17} --gpu=${18}

or use run.sh

cd ./T-DiffRec
sh run.sh dataset lr weight_decay batch_size dims emb_size mean_type steps noise_scale noise_min noise_max sampling_steps reweight w_min w_max log_name round gpu_id

LT-DiffRec

cd ./L-DiffRec
python main.py --cuda --dataset=$1 --data_path=../datasets/$1/ --emb_path=../datasets/ --lr1=$2 --lr2=$3 --wd1=$4 --wd2=$5 --batch_size=$6 --n_cate=$7 --in_dims=$8 --out_dims=$9 --lamda=${10} --mlp_dims=${11} --emb_size=${12} --mean_type=${13} --steps=${14} --noise_scale=${15} --noise_min=${16} --noise_max=${17} --sampling_steps=${18} --reweight=${19} --w_min=${20} --w_max=${21} --log_name=${22} --round=${23} --gpu=${24}

or use run.sh

cd ./L-DiffRec
sh run.sh dataset lr1 lr2 wd1 wd2 batch_size n_cate in_dims out_dims lamda mlp_dims emb_size mean_type steps noise_scale noise_min noise_max sampling_steps reweight w_min w_max log_name round gpu_id

Inference

  1. Download the checkpoints released by us from OneDrive.
  2. Put the 'checkpoints' folder into the current folder.
  3. Run inference.py
python inference.py --dataset=$1 --gpu=$2

Examples

  1. Train DiffRec on Amazon-book under clean setting
cd ./DiffRec
sh run.sh amazon-book_clean 5e-5 0 400 [1000] 10 x0 5 0.0001 0.0005 0.005 0 1 log 1 0
  1. Inference L-DiffRec on Yelp under noisy setting
cd ./L-DiffRec
python inference.py --dataset=yelp_noisy --gpu=0

Citation

If you use our code, please kindly cite:

@inproceedings{wang2023diffrec,
title = {Diffusion Recommender Model},
author = {Wang, Wenjie and Xu, Yiyan and Feng, Fuli and Lin, Xinyu and He, Xiangnan and Chua, Tat-Seng},
booktitle = {Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {832–841},
publisher = {ACM},
year = {2023}
}

diffrec's People

Contributors

injadlu avatar ouxiang-li avatar wyuan1001 avatar yiyanxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

diffrec's Issues

Dataset split

Hi,

I read in the paper that the sorted interactions are be splited into training, validation, and testing sets with the ratio of 7:1:2. But the valid dataset in this repository is clearly larger than the test dataset, more like 7:2:1. Is there some problem here?

Best.

Missing item_emb.npy in amazon-book_clean dataset

Excuse me, after I unrar the amazon-book_clean.rar, I find there missing item_emb.npy. Could you please upload the dataset again?

FileNotFoundError: [Errno 2] No such file or directory: '../datasets/amazon-book_clean/item_emb.npy'

How to get better results

I used the default hyper parameters "!python main.py --cuda --dataset=ml-1m_clean --data_path=../datasets/ml-1m_clean/"
the results are less than 0.1 and the loss is about 180

dataset加载失败

你好,我尝试使用“ sh run.sh amazon-book_clean 5e-5 0 400 [1000] 10 x0 5 0.0001 0.0005 0.005 0 1 log 1 0 ” 命令运行代码,但是在amazon-book 数据集加载过程中出现 ValueError: cannot reshape array of size 4566535 into shape (2283281,2)。

ratio of ml-1m_clean

  1. 论文4.1.1第二段提到了“splits the sorted interactions into training, validation, and testing sets with the ratio of 7:1:2”,我在下载到的ml-1m_clean数据集中发现train,valid,test的数据条数分别为403277,110722,57532,这是7:2:1的比例
  2. DiffRec/L-DiffRec/main.py中第306行调用evaluate的第三个参数是否应为mask_train而不是mask_tv
  3. 参考DiffRec/L-DiffRec/inference.py中ml-1m_clean的参数设置,发现测试集的Recall和NDCG指标明显高于验证集,这是否是由于划分数据集时按时间排序(论文4.1.1第二段)导致训练、测试、验证集不满足独立同分布性质

A Question about Implementation of Eq.4

Thanks for sharing your codes. And I have a question about implementation of eq.4.

For function betas_from_linear_variance in gaussian_diffusion.py, let argument variance be $\gamma$ (right part of the eq.4), and alpha_bar $= 1-\gamma$. Thus, the function aims to solve $\beta$ using $\gamma$.

For eq.4, $1-\bar{\alpha}_{t} =1- \alpha_1\alpha_2\cdots\alpha_t=1-(1-\beta_1)(1-\beta_2)\cdots(1-\beta_t)=\gamma_t$

For $t=1$ in eq.4: $1-\bar{\alpha}_1 = 1-\alpha_1 = 1-(1-\beta_1) = \beta_1=\gamma_1$ (third line of the function),

For $t=2$ in eq.4: $1-\bar{\alpha}_2 = 1-\alpha_1\alpha_2 = 1-(1-\beta_1)(1-\beta_2) = \gamma_2$ ,

thus $\beta_2=1-(1-\gamma_2)/(1-\beta_1) = 1-(1-\gamma_2)/(1-\gamma_1)$ (first execution of the for loop)

For $t=3$ in eq.4: $1-\bar{\alpha}_3 = 1-\alpha_1\alpha_2\alpha_3 = 1-(1-\beta_1)(1-\beta_2)(1-\beta_3) = \gamma_3$ ,

thus $\beta_3=1-(1-\gamma_3)/[(1-\beta_1)(1-\beta_2)] = 1-(1-\gamma_3)/[(1-\beta_1)(1-\beta_2)]$

However $(1-\beta_1)(1-\beta_2) \neq 1-\gamma_2$ , is a cumprod operation neglected?

L-DiffRec betas out of range

ser num:108822iten num:94949data ready.
running k-means on cuda:0..
[running kneans]: 0it [00:00,?it/s,center_shift=0.066783,iteration=1, tol=0
[running kneans]: 1it [00:00,10.89it/s,center_shift=0.002020,iteration=2,tol[running kneans]: 2it [00:00,15.43it/s, center_shift=0.000370,iteration=3, tol[running kneans]: 3it[00:00,23.13it/s,center_shift=0.000370, iteration=3, tol
[running kneansj: 3it [00:00,23.13it/s, center_shift=0.000044,iteration=4,tol[running kneans]: 4it [00:00,25.34it/s, center_shift=0.000044,iteration=4,tol
category length:[9495,85454]
Latent dims of each category:[[30],[270]]Traceback (most recent call last):
File "main.py" , line 133, in
diffusion = gd.GaussianDiffusion(nean_type,args.noise_schedule,
File "/media/wang/study/jhs/DiffRec-main/L-DiffRec/models/gaussian_diffusion
y", line 35, in init
assert (self. betas > 0).all() and (self. betas = 1).all(), "betas out of range"
AssertionError: betas out of range

May I ask the author, when I reproduce L-Diffrec, according to the default parameter execution, there will be this error, I do not understand, please explain.

How to understand the linear noise schedule (Eq. 4) in paper?

Notice that the author uses a new linear noise schedule instead of the Linear or cosine schedules used in DDPM. The selection in the code is noise_ schedule='linear var', which corresponds to lines 303-309 in gaussian_diffusion. py, but I do not understand the correspondence between these codes and Eq. 4 in the paper. I hope the author can help me.
Looking forward to your reply very much.

[Comparison of DiffRec and L-DiffRec] Which one is generally better?

As title, I am wondering if L-DiffRec is generally better than DiffRec at a rather small scale.
In your paper, you have shown that L-DiffRec is better in the noisy environment. I wonder if you put L-DiffRec in table 2, where will be its ranking among all your compared baselines? Will it generally surpass DiffRec?

Hyperparamter

Hi YiyanXu!

Thank you for your insightful work.

Can you share the set of hyperparameters that diffrers from your default values in the script, to reproduce the result of "ML-1M clean dataset"?

How to generate train_list.npy

I do not see any code on how you generated the train_list.npy for your datasets. Does this file record all user_id and item_id with interaction records? Or should we only retain data that has been filtered by 5-core?

betas out of range in gaussian_diffusion.py", line 35

I am running "sh run.sh amazon-book_clean 5e-4 1e-4 0 0 400 2 [300] [] 0.05 [300] 10 x0 5 0.5 0.001 0.0005 0 1 log 1 0" for L-DiffRec, but It turns out to be with negative beta that out of range. I trace the code and found it uses "linear-var" as noise_schedule. I print the beta result and it is as: [ 5.00000000e-04, -6.25312656e-05, -6.25273557e-05, -6.25234463e-05, -6.25195374e-05]. Could you please help me to check the problem?

How to generate item_emb.npy

I do not see any code on how you generated the item embeddings for your datasets. How were the item embeddings created for the Autoencoders? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.