GithubHelp home page GithubHelp logo

rucaibox / recbole Goto Github PK

View Code? Open in Web Editor NEW
3.2K 41.0 581.0 19.29 MB

A unified, comprehensive and efficient recommendation library

Home Page: https://recbole.io/

License: MIT License

Python 81.94% Shell 0.08% Jupyter Notebook 17.98%
recommender-systems collaborative-filtering knowledge-graph ctr-prediction deep-learning pytorch graph-neural-networks sequential-recommendation recommendations recommendation-system

recbole's Introduction

RecBole Logo


RecBole (伯乐)

“世有伯乐,然后有千里马。千里马常有,而伯乐不常有。”——韩愈《马说》

PyPi Latest Release Conda Latest Release License arXiv

HomePage | Docs | Datasets | Paper | Blogs | Models | 中文版

RecBole is developed based on Python and PyTorch for reproducing and developing recommendation algorithms in a unified, comprehensive and efficient framework for research purpose. Our library includes 91 recommendation algorithms, covering four major categories:

  • General Recommendation
  • Sequential Recommendation
  • Context-aware Recommendation
  • Knowledge-based Recommendation

We design a unified and flexible data file format, and provide the support for 43 benchmark recommendation datasets. A user can apply the provided script to process the original data copy, or simply download the processed datasets by our team.

RecBole v0.1 architecture
Figure: RecBole Overall Architecture

In order to support the study of recent advances in recommender systems, we construct an extended recommendation library RecBole2.0 consisting of 8 packages for up-to-date topics and architectures (e.g., debiased, fairness and GNNs).

Feature

  • General and extensible data structure. We design general and extensible data structures to unify the formatting and usage of various recommendation datasets.

  • Comprehensive benchmark models and datasets. We implement 78 commonly used recommendation algorithms, and provide the formatted copies of 28 recommendation datasets.

  • Efficient GPU-accelerated execution. We optimize the efficiency of our library with a number of improved techniques oriented to the GPU environment.

  • Extensive and standard evaluation protocols. We support a series of widely adopted evaluation protocols or settings for testing and comparing recommendation algorithms.

RecBole News

new 11/01/2023: We release RecBole v1.2.0.

new 11/06/2022: We release the optimal hyperparameters of the model and their tuning ranges.

10/05/2022: We release RecBole v1.1.1.

06/28/2022: We release RecBole2.0 with 8 packages consisting of 65 newly implement models.

02/25/2022: We release RecBole v1.0.1.

09/17/2021: We release RecBole v1.0.0.

03/22/2021: We release RecBole v0.2.1.

01/15/2021: We release RecBole v0.2.0.

12/10/2020: 我们发布了RecBole小白入门系列中文博客(持续更新中)

12/06/2020: We release RecBole v0.1.2.

11/29/2020: We constructed preliminary experiments to test the time and memory cost on three different-sized datasets and provided the test result for reference.

11/03/2020: We release the first version of RecBole v0.1.1.

Latest Update for SIGIR 2023 Submission

To better meet the user requirements and contribute to the research community, we present a significant update of RecBole in the latest version, making it more user-friendly and easy-to-use as a comprehensive benchmark library for recommendation. We summarize these updates in "Towards a More User-Friendly and Easy-to-Use Benchmark Library for Recommender Systems" and submit the paper to SIGIR 2023. The main contribution in this update is introduced below.

Our extensions are made in three major aspects, namely the models/datasets, the framework, and the configurations. Furthermore, we provide more comprehensive documentation and well-organized FAQ for the usage of our library, which largely improves the user experience. More specifically, the highlights of this update are summarized as:

  1. We introduce more operations and settings to help benchmarking the recommendation domain.

  2. We improve the user friendliness of our library by providing more detailed documentation and well-organized frequently asked questions.

  3. We point out several development guidelines for the open-source library developers.

These extensions make it much easier to reproduce the benchmark results and stay up-to-date with the recent advances on recommender systems. The datailed comparison between this update and previous versions is listed below.

Aspect RecBole 1.0 RecBole 2.0 This update
Recommendation tasks 4 categories 3 topics and 5 packages 4 categories
Models and datasets 73 models and 28 datasets 65 models and 8 new datasets 91 models and 43 datasets
Data structure Implemented Dataset and Dataloader Task-oriented Compatible data module inherited from PyTorch
Continuous features Field embedding Field embedding Field embedding and discretization
GPU-accelerated execution Single-GPU utilization Single-GPU utilization Multi-GPU and mixed precision training
Hyper-parameter tuning Serial gradient search Serial gradient search Three search methods in both serial and parallel
Significance test - - Available interface
Benchmark results - Partially public (GNN and CDR) Benchmark configurations on 82 models
Friendly usage Documentation Documentation Improved documentation and FAQ page

Installation

RecBole works with the following operating systems:

  • Linux
  • Windows 10
  • macOS X

RecBole requires Python version 3.7 or later.

RecBole requires torch version 1.7.0 or later. If you want to use RecBole with GPU, please ensure that CUDA or cudatoolkit version is 9.2 or later. This requires NVIDIA driver version >= 396.26 (for Linux) or >= 397.44 (for Windows10).

Install from conda

conda install -c aibox recbole

Install from pip

pip install recbole

Install from source

git clone https://github.com/RUCAIBox/RecBole.git && cd RecBole
pip install -e . --verbose

Quick-Start

With the source code, you can use the provided script for initial usage of our library:

python run_recbole.py

This script will run the BPR model on the ml-100k dataset.

Typically, this example takes less than one minute. We will obtain some output like:

INFO ml-100k
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
The sparsity of the dataset: 93.70575143257098%
INFO Evaluation Settings:
Group by user_id
Ordering: {'strategy': 'shuffle'}
Splitting: {'strategy': 'by_ratio', 'ratios': [0.8, 0.1, 0.1]}
Negative Sampling: {'strategy': 'full', 'distribution': 'uniform'}
INFO BPRMF(
    (user_embedding): Embedding(944, 64)
    (item_embedding): Embedding(1683, 64)
    (loss): BPRLoss()
)
Trainable parameters: 168128
INFO epoch 0 training [time: 0.27s, train loss: 27.7231]
INFO epoch 0 evaluating [time: 0.12s, valid_score: 0.021900]
INFO valid result:
recall@10: 0.0073  mrr@10: 0.0219  ndcg@10: 0.0093  hit@10: 0.0795  precision@10: 0.0088
...
INFO epoch 63 training [time: 0.19s, train loss: 4.7660]
INFO epoch 63 evaluating [time: 0.08s, valid_score: 0.394500]
INFO valid result:
recall@10: 0.2156  mrr@10: 0.3945  ndcg@10: 0.2332  hit@10: 0.7593  precision@10: 0.1591
INFO Finished training, best eval result in epoch 52
INFO Loading model structure and parameters from saved/***.pth
INFO best valid result:
recall@10: 0.2169  mrr@10: 0.4005  ndcg@10: 0.235  hit@10: 0.7582  precision@10: 0.1598
INFO test result:
recall@10: 0.2368  mrr@10: 0.4519  ndcg@10: 0.2768  hit@10: 0.7614  precision@10: 0.1901

If you want to change the parameters, such as learning_rate, embedding_size, just set the additional command parameters as you need:

python run_recbole.py --learning_rate=0.0001 --embedding_size=128

If you want to change the models, just run the script by setting additional command parameters:

python run_recbole.py --model=[model_name]

Auto-tuning Hyperparameter

Open RecBole/hyper.test and set several hyperparameters to auto-searching in parameter list. The following has two ways to search best hyperparameter:

  • loguniform: indicates that the parameters obey the uniform distribution, randomly taking values from e^{-8} to e^{0}.
  • choice: indicates that the parameter takes discrete values from the setting list.

Here is an example for hyper.test:

learning_rate loguniform -8, 0
embedding_size choice [64, 96 , 128]
train_batch_size choice [512, 1024, 2048]
mlp_hidden_size choice ['[64, 64, 64]','[128, 128]']

Set training command parameters as you need to run:

python run_hyper.py --model=[model_name] --dataset=[data_name] --config_files=xxxx.yaml --params_file=hyper.test
e.g.
python run_hyper.py --model=BPR --dataset=ml-100k --config_files=test.yaml --params_file=hyper.test

Note that --config_files=test.yaml is optional, if you don't have any customize config settings, this parameter can be empty.

This processing maybe take a long time to output best hyperparameter and result:

running parameters:                                                                                                                    
{'embedding_size': 64, 'learning_rate': 0.005947474154838498, 'mlp_hidden_size': '[64,64,64]', 'train_batch_size': 512}                
  0%|                                                                                           | 0/18 [00:00<?, ?trial/s, best loss=?]

More information about parameter tuning can be found in our docs.

Time and Memory Costs

We constructed preliminary experiments to test the time and memory cost on three different-sized datasets (small, medium and large). For detailed information, you can click the following links.

NOTE: Our test results only gave the approximate time and memory cost of our implementations in the RecBole library (based on our machine server). Any feedback or suggestions about the implementations and test are welcome. We will keep improving our implementations, and update these test results.

RecBole Major Releases

Releases Date
v1.2.0 11/01/2023
v1.1.1 10/05/2022
v1.0.0 09/17/2021
v0.2.0 01/15/2021
v0.1.1 11/03/2020

Open Source Contributions

As a one-stop framework from data processing, model development, algorithm training to scientific evaluation, RecBole has a total of 11 related GitHub projects including

In the following table, we summarize the open source contributions of GitHub projects based on RecBole.

Projects Stars Forks Issues Pull requests
RecBole Stars Forks Issues Pull requests
RecBole2.0 Stars Forks Issues Pull requests
RecBole-DA Stars Forks Issues Pull requests
RecBole-MetaRec Stars Forks Issues Pull requests
RecBole-Debias Stars Forks Issues Pull requests
RecBole-FairRec Stars Forks Issues Pull requests
RecBole-CDR Stars Forks Issues Pull requests
RecBole-GNN Stars Forks Issues Pull requests
RecBole-TRM Stars Forks Issues Pull requests
RecBole-PJF Stars Forks Issues Pull requests
RecSysDatasets Stars Forks Issues Pull requests

Contributing

Please let us know if you encounter a bug or have any suggestions by filing an issue.

We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

We thank the insightful suggestions from @tszumowski, @rowedenny, @deklanw et.al.

We thank the nice contributions through PRs from @rowedenny@deklanw et.al.

Cite

If you find RecBole useful for your research or development, please cite the following papers: RecBole[1.0], RecBole[2.0] and RecBole[1.2.0].

@inproceedings{recbole[1.0],
  author    = {Wayne Xin Zhao and Shanlei Mu and Yupeng Hou and Zihan Lin and Yushuo Chen and Xingyu Pan and Kaiyuan Li and Yujie Lu and Hui Wang and Changxin Tian and Yingqian Min and Zhichao Feng and Xinyan Fan and Xu Chen and Pengfei Wang and Wendi Ji and Yaliang Li and Xiaoling Wang and Ji{-}Rong Wen},
  title     = {RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms},
  booktitle = {{CIKM}},
  pages     = {4653--4664},
  publisher = {{ACM}},
  year      = {2021}
}
@inproceedings{recbole[2.0],
  author    = {Wayne Xin Zhao and Yupeng Hou and Xingyu Pan and Chen Yang and Zeyu Zhang and Zihan Lin and Jingsen Zhang and Shuqing Bian and Jiakai Tang and Wenqi Sun and Yushuo Chen and Lanling Xu and Gaowei Zhang and Zhen Tian and Changxin Tian and Shanlei Mu and Xinyan Fan and Xu Chen and Ji{-}Rong Wen},
  title     = {RecBole 2.0: Towards a More Up-to-Date Recommendation Library},
  booktitle = {{CIKM}},
  pages     = {4722--4726},
  publisher = {{ACM}},
  year      = {2022}
}
@inproceedings{recbole[1.2.0],
  author    = {Lanling Xu and Zhen Tian and Gaowei Zhang and Junjie Zhang and Lei Wang and Bowen Zheng and Yifan Li and Jiakai Tang and Zeyu Zhang and Yupeng Hou and Xingyu Pan and Wayne Xin Zhao and Xu Chen and Ji{-}Rong Wen},
  title     = {Towards a More User-Friendly and Easy-to-Use Benchmark Library for Recommender Systems},
  booktitle = {{SIGIR}},
  pages     = {2837--2847},
  publisher = {{ACM}},
  year      = {2023}
}

The Team

RecBole is developed by RUC, BUPT, ECNU, and maintained by RUC.

Here is the list of our lead developers in each development phase. They are the souls of RecBole and have made outstanding contributions.

Time Version Lead Developers Paper
June 2020
~
Nov. 2020
v0.1.1 Shanlei Mu (@ShanleiMu), Yupeng Hou (@hyp1231),
Zihan Lin (@linzihan-backforward), Kaiyuan Li (@tsotfsk)
PDF
Nov. 2020
~
Jul. 2022
v0.1.2 ~ v1.0.1 Yushuo Chen (@chenyushuo), Xingyu Pan (@2017pxy) PDF
Jul. 2022
~
Nov. 2023
v1.1.0 ~ v1.1.1 Lanling Xu (@Sherry-XLL), Zhen Tian (@chenyuwuxin), Gaowei Zhang (@Wicknight), Lei Wang (@Paitesanshi), Junjie Zhang (@leoleojie) PDF
Nov. 2023
~
now
v1.2.0 Bowen Zheng (@zhengbw0324), Chen Ma (@Yilu114) PDF

License

RecBole uses MIT License. All data and code in this project can only be used for academic purposes.

Acknowledgments

This project was supported by National Natural Science Foundation of China (No. 61832017).

recbole's People

Contributors

2017pxy avatar ahuiwang avatar aoidragon avatar believefxy avatar bishopliu avatar changxintian avatar chenglongma avatar chenyushuo avatar cieemio avatar cyli-tiger avatar dexterruc avatar eliverq avatar ethan-tz avatar flust avatar guan-jw avatar guijiql avatar hyp1231 avatar leoleojie avatar linzihan-backforward avatar lyj1998 avatar paitesanshi avatar richardhgl avatar shanleimu avatar sherry-xll avatar tangjiakai avatar taytroye avatar tsotfsk avatar wicknight avatar yibo-li-1 avatar zhengbw0324 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recbole's Issues

[🐛BUG] Out-of-memory for LightGCN on the gowalla dataset

Describe the bug
I am unable to run LightGCN on the gowalla dataset. It appears it is out-of-memory, by a lot:

MemoryError: Unable to allocate 511. GiB for an array with shape (107093, 1280970) and data type float32    

I imagine it may by user-error on my part. But with the following direct command, I can't get it to fit using different parameters:

$ python3 run_recbole.py --dataset=gowalla --model=LightGCN --metrics=Recall --valid_metric=Recall@100 --topk=100 --train_batch_size=4 --eval_batch_size=4

To Reproduce
Steps to reproduce the behavior:

  1. extra yaml file: None
  2. your code: None
  3. script for running: See screenshots

Expected behavior
Expected the ability to run gowalla dataset using LightGCN. Looking for a way to fit it into memory

Screenshots
Here is a trace of the full run:

Command:

$ python3 run_recbole.py --dataset=gowalla --model=LightGCN --metrics=Recall --valid_metric=Recall@100 --topk=100 --train_batch_size=4 --eval_batch_size=4

Output:

20 Nov 23:58    INFO General Hyper Parameters:
gpu_id=0          
use_gpu=True 
seed=2020                                                                                                                                                                                                          [76/1356]
state=INFO   
reproducibility=True
data_path=dataset/gowalla
                    
Training Hyper Parameters:
checkpoint_dir=saved              
epochs=300       
train_batch_size=4     
learner=adam              
learning_rate=0.001         
training_neg_sample_num=1   
eval_step=1                  
stopping_step=10         

Evaluation Hyper Parameters:
eval_setting=RO_RS,full
group_by_user=True                                                                
split_ratio=[0.8, 0.1, 0.1]
leave_one_num=2                                                      
real_time_process=True
metrics=Recall              
topk=100
valid_metric=Recall@100
eval_batch_size=4

Dataset Hyper Parameters:
field_separator=
seq_separator=
USER_ID_FIELD=user_id
ITEM_ID_FIELD=item_id
RATING_FIELD=rating
LABEL_FIELD=label
threshold=None
NEG_PREFIX=neg_
load_col={'inter': ['user_id', 'item_id']}
unload_col=None
additional_feat_suffix=None
max_user_inter_num=None
min_user_inter_num=0
max_item_inter_num=None
min_item_inter_num=0
lowest_val=None
highest_val=None                                                                                                                                                                                                 
equal_val=None                                
not_equal_val=None                                      
drop_filter_field=True                                  
fields_in_same_space=None         
fill_nan=True                                
preload_weight=None                                                                       
drop_preload_weight=True                                                                                
normalize_field=None                                                           
normalize_all=True                                                                                              
ITEM_LIST_LENGTH_FIELD=item_length                                
LIST_SUFFIX=_list                                                                                                       
MAX_ITEM_LIST_LENGTH=50                                      
POSITION_FIELD=position_id                                                                                        
HEAD_ENTITY_ID_FIELD=head_id            
TAIL_ENTITY_ID_FIELD=tail_id                                                                                         
RELATION_ID_FIELD=relation_id            
ENTITY_ID_FIELD=entity_id                                                                                                     
                                                 
                                                                                                                                                                                                                  
20 Nov 23:58    INFO gowalla                                    
The number of users: 107093                                              
Average actions of users: 37.17675456616741
The number of items: 1280970                                          
Average actions of items: 3.1080635050496928
The number of inters: 3981333
The sparsity of the dataset: 99.99709779249932%
Remain Fields: ['user_id', 'item_id']
20 Nov 23:58    INFO Build [ModelType.GENERAL] DataLoader for [train] with format [InputType.PAIRWISE]
20 Nov 23:58    INFO Evaluation Setting:
        Group by user_id
        Ordering: {'strategy': 'shuffle'}
        Splitting: {'strategy': 'by_ratio', 'ratios': [0.8, 0.1, 0.1]}
        Negative Sampling: {'strategy': 'by', 'distribution': 'uniform', 'by': 1}
20 Nov 23:58    INFO batch_size = [[4]], shuffle = [True]

20 Nov 23:58    INFO Build [ModelType.GENERAL] DataLoader for [evaluation] with format [InputType.POINTWISE]
20 Nov 23:58    INFO Evaluation Setting:
        Group by user_id
        Ordering: {'strategy': 'shuffle'}
        Splitting: {'strategy': 'by_ratio', 'ratios': [0.8, 0.1, 0.1]}
        Negative Sampling: {'strategy': 'full', 'distribution': 'uniform'}
20 Nov 23:58    INFO batch_size = [[4, 4]], shuffle = [False]

20 Nov 23:58    WARNING Batch size is changed to 1280970
20 Nov 23:58    WARNING Batch size is changed to 1280970
Traceback (most recent call last):
  File "run_recbole.py", line 25, in <module>
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)
  File "/home/szumowskit1/workspace/RecBole/recbole/quick_start/quick_start.py", line 45, in run_recbole
    model = get_model(config['model'])(config, train_data).to(config['device'])
  File "/home/szumowskit1/workspace/RecBole/recbole/model/general_recommender/lightgcn.py", line 69, in __init__
    self.norm_adj_matrix = self.get_norm_adj_mat().to(self.device)
  File "/home/szumowskit1/workspace/RecBole/recbole/model/general_recommender/lightgcn.py", line 90, in get_norm_adj_mat
    A[:self.n_users, self.n_users:] = self.interaction_matrix
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/lil.py", line 333, in __setitem__
    IndexMixin.__setitem__(self, key, x)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/_index.py", line 116, in __setitem__
    self._set_arrayXarray_sparse(i, j, x)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/lil.py", line 319, in _set_arrayXarray_sparse
    x = np.asarray(x.toarray(), dtype=self.dtype)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/coo.py", line 321, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/szumowskit1/.venv/recbole/lib/python3.7/site-packages/scipy/sparse/base.py", line 1185, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 511. GiB for an array with shape (107093, 1280970) and data type float32

Colab Links
N/A

Desktop (please complete the following information):

  • OS: Linux
  • RecBole Version: 0.1.1
  • Python Version: 3.7.6
  • PyTorch Version: 1.7.0+cu101
  • cudatoolkit Version: 10.1
  • GPU: P100 (16GB)
  • Machine Specs: 16 CPU machine, 60GB RAM

RuntimeError: Trying to create tensor with negative dimension -25: [-25]

训练过程报错

  • 为了测试,我收集了一份简单的日志,并按照要求把文件转换成recbole指定的格式,但是运行的时候报错:RuntimeError: Trying to create tensor with negative dimension -25: [-25]
  • 另外时间戳的单位是(s or ms?)

以下是我的代码

  1. test.yaml
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
load_col:
    inter: [rating, user_id, item_id, timestamp]

min_user_inter_num: 10
min_item_inter_num: 10
lowest_val:
    rating: 2
eval_setting: RO_RS,full
split_ratio: [0.8,0.1,0.1]
  1. run.py
from recbole.config import Config
from recbole.data import create_dataset,data_preparation
from recbole.model.general_recommender import BPR
from recbole.trainer import Trainer
from logging import  getLogger
from recbole.utils import init_seed,init_logger

if __name__ == '__main__':
    # 初始化配置
    config = Config(model='BPR',dataset='test')
    config['data_path'] = './test_data/'
    config['dataset'] = 'test'
    # 初始随机种子,确保实验的可重复性
    init_seed(config['seed'],config['reproducibility'])
    ## 初始化日志
    init_logger(config)
    logger = getLogger()
    ## 将配置信息写入日志
    logger.info(config)
    ## 数据集创建和筛选
    dataset = create_dataset(config)
    logger.info(dataset)
    ## 数据集划分
    train_data,valid_data,test_data = data_preparation(config,dataset)

    ## 模型载入和初始化
    model = BPR(config,train_data).to(config['device'])
    logger.info(model)
    ## 训练器加载和初始化
    trainer = Trainer(config,model)

    # 模型训练
    best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

    ## 模型评估
    test_result = trainer.evaluate(test_data)
    print(test_result)

以下为完整的报错截图

以下是我的数据集样式

OS information:

  • OS: windows
  • RecBole Version 不太清楚
  • Python Version 3.6.12
  • PyTorch Version 1.7.0
  • cudatoolkit Version 11.1

新手上路,麻烦大大们了

[🐛BUG] Minor Typo on Atomic Files Page

Describe the bug
I wasn't sure where or what repository this HTML is generated in. But there is a minor typo in https://recbole.io/atomic_files.html. Under ml-1m.item, it says user_id:token. I believe that should say item_id:token

To Reproduce
Go to https://recbole.io/atomic_files.html

Expected behavior
See description

Screenshots
image

Colab Links
If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

  • OS: [e.g. Linux, macOS or Windows]
  • RecBole Version [e.g. 0.1.0]
  • Python Version [e.g. 3.79]
  • PyTorch Version [e.g. 1.60]
  • cudatoolkit Version [e.g. 9.2, none]

Will you support MIND dataset?

Is your feature request related to a problem? Please describe.
MIND (MIcrosoft News Dataset) is published by Microsoft, which is a recommendation dataset comprised of user behaviors and news.

Will there be any chances that RecBole can support this dataset in the future?

Looking forward to your replies.

[💡SUG] Implement DICE for disentanglement beyond MF

The paper for DGCF says,

> Causal approaches usually serve as additional methods upon backbone models. We use the most adopted backbone, Matrix Factorization (MF) [27] to compare different approaches. Meanwhile, we also incorporate the state-of-the-art collaborate filtering model, Graph Convolutional Networks (GCN) [16,47,18], to investigate whether algorithms generalize across different backbones.

The paper reports even better performance applying DICE to LightGCN than MF. But the model implemented here appears to only apply the disentanglement to MF.

EDIT:

Oops, wrong paper. I was looking at "Disentangling User Interest and Popularity Bias for Recommendation with Causal Embedding", a paper citing the former with common authors.

I suppose this issue should be: implement DICE such that it can be applied generally to MF, LightGCN, etc unlike DGCF which is applied only to MF.

[💡SUG] Remove Duplicated User and Item Interaction

Is your feature request related to a problem? Please describe.
In some cases, we would like to merge the duplicated user-item interactions by keeping the earliest one (referring to the timestamp). The rationality here is to test the performance of a method in recommending novel items that a user did not consume before.

Describe the solution you'd like

  1. It is assumed to be an optional function, since it may not be universally true for the general recommendation system;
  2. Looking through the implementation of Dataset, I think to add a new function within data_processing is one alternative solution. Here is one example,
def _remove_duplication(self):
    self.inter_feat = self.inter_feat.sort_values(by=[self.time_field], ascending=True)
    self.inter_feat = self.inter_feat.drop_duplicates(subset=[self.uid_field, self.iid_field], keep='first')

Describe alternatives you've considered
None

Additional context
None

[💡SUG] Add callback function hook for end of training an epoch

Is your feature request related to a problem? Please describe.
I do not currently see a built-in way to assign a callback function to run after an epoch completes. This is helpful for pytorch-native capabilities as well as custom uses. For example, I use callbacks on epoch-complete for external distributed hyperparameter optimization routines.

Describe the solution you'd like
Add a callback function hook into the trainers that provides epoch index and the validation score for use externally. See context section for an example patch. I'm happy to PR it in, or feel free to take as you see appropriate.

Describe alternatives you've considered
None considered.

Additional context

I attached a patch file that adds an optional callback_fn. Note: This patch only shows it for the Trainer class, not the child classes. That is because this is for an example only. Also, if adopting this, it may be worth adding some try/except protection around the callback function so that any malformed callback behavior doesn't impact the main training process. Since that is a design decision, I didn't include it in the example.

trainer.py.patch.txt

[💡SUG] Do you have training and evaluation speed reference benchmarks?

Is your feature request related to a problem? Please describe.
Will you be able to post how long it typically takes to train and evaluate for an epoch for the models? Even just for one large dataset this can be helpful for the community (i.e. MovieLens-1M)

I notice #484 and #485. I understand from that PR there are no plans to keep a scoreboard.

However, it's a bit difficult to determine whether or now it is worth benchmarking an algorithm because any given algorithm may take hours to run a single epoch on a GPU.

For example, in a private dataset comparable to MovieLens-10M, I'm seeing drastically different training times across the general recommenders using a P100 GPU, from a few seconds/epoch to several minutes/epoch.

Having preliminary train/evaluation times would help a user understand accuracy vs. speed tradeoffs. It will also help users and developers benchmark speeds against other open-source implementation.

Describe the solution you'd like
A preliminary list of training time per-epoch and evaluation time per-epoch using default configurations for each recommender, using MovieLens-1M dataset.

Describe alternatives you've considered
N/A

Additional context
N/A

[Question] Custom Dataset

Is there an example of generating a custom dataset?
I need to transform raw data into atomic ones then dataframe? Or can I simply transform raw files (let's assume it's txt) into dataframe and put it into custom dataloader?

[QUESTION] Q about sequential models

Hi
I would like to experiment on several sequential models.
I have user session set of clicks and I want to be able to predict the next click in each stage of the session.
For example, a session has: click1, click2, click3, click4.
I want to be able to predict from click1 to click2, and also given click1 and click 2 I want to predict click3 and so on.
Not just the last click.
In other models I examined, I had to build the train data in a way that I feed each session by click, meaning:
X1: click1
Y1: click2
X2: click1, click2
Y2: click3
X3: click1, click2,click3
Y3: click4

Should I do the same using your library or no need?
Many thanks!!

[🐛BUG] Implausible metrics?

Trying out my implementation of SLIM with ElasticNet #621 I'm noticing some implausible numbers. Dataset is ml-100k with all defaults. Using default hyperparameters of my method defined in its yaml file (not yet well-chosen because these results are so off) https://github.com/RUCAIBox/RecBole/blob/41a06e59ab26482dbfac641caac99876c167168c/recbole/properties/model/SLIMElastic.yaml

Using this standard copy-pasted code

dataset_name = "ml-100k"

model = SLIMElastic

config = Config(model=model, dataset=dataset_name)
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

logger.info(config)

# dataset filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = model(config, train_data).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)

logger.info('best valid result: {}'.format(best_valid_result))
logger.info('test result: {}'.format(test_result))

Results:
INFO test result: {'recall@10': 0.8461, 'mrr@10': 0.5374, 'ndcg@10': 0.7102, 'hit@10': 1.0, 'precision@10': 0.6309}

Also, my HyperOpt log is highly suspicious

alpha:0.316482837679784, hide_item:False, l1_ratio:0.9890017268444972, positive_only:False
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.47984629320482386, hide_item:False, l1_ratio:0.9907136437218732, positive_only:True
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.9530393537754144, hide_item:True, l1_ratio:0.24064058250190196, positive_only:True
Valid result:
recall@10 : 0.6251    mrr@10 : 0.3611    ndcg@10 : 0.4954    hit@10 : 0.9650    precision@10 : 0.4709    
Test result:
recall@10 : 0.6535    mrr@10 : 0.4012    ndcg@10 : 0.5357    hit@10 : 0.9745    precision@10 : 0.5019    

Exact same results with different parameters?

I figure if there is a mistake in my implementation it would cause bad performance, not amazing performance.

Anyone know what could be causing this?

How to make click dataset to .inter?

I have the following columns in my dataset what columns can be useful for making .inter data for sequential recommender system

Columns List:
`event_time ,event_type ,product_id, category_id, category_code, brand, price, user_id ,user_session_id

`
thanks

[💡SUG] Add better simple non-neural baselines

The recent "A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research" https://arxiv.org/abs/1911.07698 found that relatively simple non-neural baselines which were properly tuned outperformed basically all the neural approaches they tested.

Looking at the results in the Appendix the best performing of those baselines seem to be:

All the baselines are also implemented in the repo for the aforementioned paper here https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation

I've already implemented EASE for RecBole here #609 (it is indeed easy). I may try doing the others as well.

And, btw, the best performing neural method in their tests was Mult-VAE, which I see is underway here #603

I assume that looking for papers which cite the papers of these methods is a good start for finding promising algorithms. Some examples,

  • Boosting Item-based Collaborative Filtering via Nearly Uncoupled Random Walks (2020) paper code
  • Personalized diffusions for top-n recommendation (2019) paper code video
  • Block-Aware Item Similarity Models for Top-N Recommendation (2020) paper

If I find more I'll probably add them here

AttributeError: module 'recbole.data.dataset' has no attribute 'GCSANDataset'

I am currently trying the GCSAN model.
It works fine when I use the default ml-100k dataset.
However, when I load the preprocessed datasets such as digientica from Google Drive, it just won't work.
The file structure is listed as follows:
image
Code in run_recbole.py

import argparse

from recbole.quick_start import run_recbole


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', '-m', type=str, default='GCSAN', help='name of models')
    parser.add_argument('--dataset', '-d', type=str, default='diginetica', help='name of datasets')
    parser.add_argument('--config_files', type=str, default=None, help='config files')

    args, _ = parser.parse_known_args()

    config_file_list = args.config_files.strip().split(' ') if args.config_files else None
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)

Full error code:

Traceback (most recent call last):
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\utils.py", line 35, in create_dataset
    return getattr(importlib.import_module('recbole.data.dataset'), config['model'] + 'Dataset')(config)
AttributeError: module 'recbole.data.dataset' has no attribute 'GCSANDataset'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:/volume/SR baseline/run_recbole.py", line 16, in <module>
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\quick_start\quick_start.py", line 38, in run_recbole
    dataset = create_dataset(config)
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\utils.py", line 40, in create_dataset
    return SequentialDataset(config)
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\sequential_dataset.py", line 38, in __init__
    super().__init__(config, saved_dataset=saved_dataset)
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\dataset.py", line 100, in __init__
    self._from_scratch()
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\dataset.py", line 113, in _from_scratch
    self._data_processing()
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\dataset.py", line 152, in _data_processing
    self._data_filtering()
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\dataset.py", line 172, in _data_filtering
    self._filter_nan_user_or_item()
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\recbole\data\dataset\dataset.py", line 607, in _filter_nan_user_or_item
    dropped_inter = self.inter_feat.index[self.inter_feat[field].isnull()]
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\Wade\Anaconda3\envs\rosetta_GNN\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'user_id'

Process finished with exit code 1

About algorithm release process

Algorithm release process:

  1. A developer or participant (called implementer) from our team wrote the code for implementing an algorithm. In this process, he/she would check whether an open source code is available or not. If there was an open source code, the implementer mainly adapted its implementation within our framework. Otherwise, the implementer would mainly refer to the descriptions in the original paper.

  2. A second developer or participant (called reviewer) would be invited to review the code based on his/her understanding of the algorithm after reading the original paper. The above two steps may repeat several times if a bug was found.

  3. For each implemented algorithm, we assign a third participant (called tester) to test its performance on two or four datasets (which you can obtain from our project). The major duty of the tester is to check whether the performance is rational with automatic parameter tuning (supported by our library), e.g., whether it yields a very poor performance. If such a case was found, we would return the results to the implementer for checking the code again.

By using such a implementer-reviewer-tester mechanism, we have tried our best to make our code accurately follow the original paper. However, for algorithms without open source code, we cannot find out all the implementation details from the original paper. We implemented these details mostly based on our understanding of the original paper. In other words, we can't guarantee that all the algorithms have been implemented exactly as originally expected, e.g., the negative sampling way or optimization tricks. For algorithms implemented in a different platform or language (e.g., TensorFlow or JAVA), there might be also variations at the implementation level. If you found a issue with the algorithm implementation in our library, please kindly let us know.

Note, although we have tested all the implemented models on at least two datasets (most are tested on four datasets), we will never maintain a score board to compare the performance of different algorithms. In other words, we will not release any test results of the implemented models. However, we have prepared the default parameter settings for all models. You can simply obtain their performance with very simple commands using the default parameters. For example, to run the BPR model on ml-100k dataset, you can run the following script:

python run_recbole.py --dataset=ml-100k --model=BPR 

Notes, the default parameters are not the optimal setting. That is why our library sets up the automatic parameter tuning function. For the introduction of different model parameters and the usage of the automatic parameter tuning function, please refer to our API Doc.

Thanks for your attention. We will continually keep the model implementation updated to remove or avoid any possible implementation issues.

[💡SUG] Do you support the ability to extract recommendations after training and evaluation?

Is your feature request related to a problem? Please describe.
After I train and evaluation, I am struggling to figure out how to run the models on data after-the-fast. For example,

  • suppose I have a series of user IDs for which I'd like to get the top-10 rankings for. or
  • suppose I have a user ID and an item ID, and would like to know what is the item's positional ranking for that user

Because the DataLoader's class has several other elements built in (e.g. negative sampling config, used IDs information), I am unsure how to create a "fresh" DataLoader with IDs that I'd like to generate rankings for.

Describe the solution you'd like
An exposed interface/function to evaluate:

  • scores and/or rankings of items, given a user ID
  • score and position of an item, given an item ID and a user ID

Describe alternatives you've considered
I've tried to "override" the test_data DataLoader object in the quickstart with my own data, but ran into indexing issues and figured I'd ask first.

Additional context
N/A -- Thank you for open-sourcing this! Really cool!

[💡SUG] Clarification of "general" vs "context-aware"

I'm a bit confused about the distinction between "general" and "context-aware".

Many of the models which are here listed as "general" can incorporate user/item features.

NeuMF paper

they can be customized to support a wide range of modelling of users and items, such as context-aware [28, 1], content-based [3], and neighbor-based [26]. Since this work focuses on the pure collaborative filtering setting, we use only the identity of a user and an item as the input feature, transforming it to a binarized sparse vector with one-hot encoding. Note that with such a generic feature representation for inputs, our method can be easily adjusted to address the cold-start problem by using content features to represent users and items.

GCMC paper

The item embedding v_i is calculated analogously with the same parameter matrix W. In the presence of user- and item-specific side information we use separate parameter matrices for user and item embeddings.

etc.

Are the contextual models just the ones which have predefined ways of embedding features? Suppose I have the Movielens dataset and I want to use just the IDs as user embeddings, but for item embeddings I want to use the IDs and the genre tags. Or, suppose I want to embed the movie synopses with Word2Vec. Can I do that with the "general" recommendation models?

Looking at how the ContextRecommender class embeds features, it basically seems to do a bunch of lookup embeddings (and leaves floats as they are). How, then, would I incorporate Word2Vec of a movie synopsis with the "context-aware" models?

Relatedly,

double_tower_embed_input_fields
Embed the whole feature columns in a double tower way.

What is a double tower way?

I think the docs have a typo here https://recbole.io/docs/user_guide/usage/running_different_models.html

General recommendation models usually needs to group data by user and perform negative sampling.

General recommendation models usually does not need to group data by user and perform negative sampling.

I'm assuming that latter one should say:

Context-aware recommendation models usually do not need to group data by user or perform negative sampling.

Why don't context-aware models group data by user or perform negative sampling?

[🐛BUG] Error loading a config file in PyCharm

Hi
I manage to run the model through command line, but when trying to get topK from here: #506
I must use Pycharm, then I altered the config code to be like this:

if __name__ == '__main__':
    # Load config, dataset and model (if you have load these things, you can skip these codes)
    config = Config(model='GRU4Rec', dataset='Ofek',config_file_list='OfekConfig.yaml')

I located the yaml file in the directory of the project, and yet I get No such file or directory: 'O'
When I look into the Config code, I see that for:
for file in file_list:
in configurator.py the iterator creates file=0 instead of the name I sent.

Please advice, Maya. Thanks!

SequentialDataset inter_feat has label leakage

I am recently developing a sequential dataset, but similar to LightGCN. So I follow the current implementation of LighGCN to fetch the historical interactions from train_data.inter_feat, but the performance is way higher than my expectation.

Then I cautiously check if there is any possibility that the test label may leak. unfortunately, it is indeed.

During leave_one_out of sequential_dataset, a direct copy of the whole dataset is implemented, and then to select the index according to the item indices. In general, it should be fine, but the key problem is the inter_feat of the dataset, no matter for train_set, test_set, valid_set are identical. So when I directly fetch the inter_feat within the trainset, the test label has leaked.

In comparison with the implementation in general dataset, it select the corresponding indices and save the partitioned inter_feat. So LightGCN does not suffer from this issue.

I am not sure whether it is a bug, but it indeed shows some risk. If some other user is not aware of the tiny issue on the sequential dataset, it makes a false improvement on the performance.

[Question] How does a session start and end?

Hi
I would like to try a few sequence models.
When I create the inter atomic file, can you tell me how can I mark a start and end of session?
None of the features can explicitly say so.
Should each user id appear only once?
Thanks

[💡SUG] A new tutorial to customized model

Is your feature request related to a problem? Please describe.
Following the documentation, I am able to customize a new model of my own. Then I would like to embed my model with the current quick start tutorial, to run through the pipeline.

However, I find it is hard to make it. Looking through the repo, the challenge here is get_model will automatically select a model class based on model name, not including the customized model. (Please correct me if my understanding is wrong)

self.internal_config_dict['MODEL_TYPE'] = get_model(model).type

I also notice that the model is created, get_model is called again later,
https://github.com/RUCAIBox/RecBole/blob/master/recbole/quick_start/quick_start.py#L45

In summary, to create a model we need to provide config and dataset, however, to successfully create a config, it requires going through the function get_model. This function only retrieves all the existing models. At least, if we follow the tutorial of quick-start, we cannot train a customized model.

Describe the solution you'd like
For the function get_model, if we cannot find the existing module named "model_name", search it additionally where the path can be specified by the user-self. This path provides the module for the customized model.

def get_model(model_name):

Describe alternatives you've considered
Would it possible to provide a tutorial introducing the pipeline for a customized model/trainer/sampler?

[💡SUG] Adding an explaination of "batch_size" for eval setting

Is your feature request related to a problem? Please describe.
When I set "eval setting" with uni100 that randomly selects 100 as the negative item and 1 as positive, I receive the warning that batch_size is changed. Dive into the code, I notice that in the current implementation, the batch_size for the eval is actually the max interaction number for all batches.
I assume the batch_size is equivalent to the number of users for the evaluation since we evaluate across all users, but in Recbole that depends on the variable step. In the init function, batch_size is also assigned to step.

More specifically, when it comes to negative sample dataloader, it comes to

if dl_format == InputType.POINTWISE:

I understand that motivation is: for each batch, instead of sampling negative for each user, we could firstly sample a subset, and then resample the subset to generate the negative item correspondingly for each user in the batch. No doubt that it will save time.
However when it comes to batch_size_adaptation
def _batch_size_adaptation(self):

The step size may change correspondingly, Intuitively, it means we would evaluate less number of the user for each batch.
Going back to my case, since I call real_time_process True, so the step is actually not changed).

In short, I assume it is a little ambiguous to the common understanding of "batch_size"

Describe the solution you'd like
I may suggest adding an explantion in the documentation, such that users would not confuse with the "batch_size" here.

[💡SUG] Sparse Optimizer training

Is your feature request related to a problem? Please describe.
Some optimizers in Pytorch allows sparse update, which could be especially fit in the recommendation settings that in each batch we get access to the part of the embeddings. Apply sparse optimization may speedup the training.

Describe the solution you'd like
Allow the users to specify the sparse optimizers.

Additional context
https://pytorch.org/docs/stable/optim.html?highlight=sparseadam#torch.optim.SparseAdam

[💡SUG] Implement more negative sampling loss functions

LightFM, in addition to BPR, implements WARP and k-OS WARP loss. They find that it outperforms BPR https://making.lyst.com/lightfm/docs/examples/warp_loss.html (performance comparison, citations to relevant papers, etc in this article)

The LightGCN papers says,

We are aware of other advanced negative sampling strategies which might improve the LightGCN training, such as the hard negative sampling [31] and adversarial sampling [9]. We leave this extension in the future since it is not the focus of this work.

Citation [31] is "Improving Pairwise Learning for Item Recommendation from Implicit Feedback" by Rendle wherein he suggests the Adaptive Oversampling loss and demonstrates its improvement over BPR. (This paper is also cited in the LightFM docs above as similar to WARP). At the end of the paper, he suggests his method might be faster than WARP (but doesn't compare):

image

These losses could of course be applied to many models, but my suggestion is that applying them to matrix factorization would be a good start. The "BPR" model could be renamed to "MF" and different loss functions could be passed as hyperparameters.

[💡SUG] Add progress bar to training an epoch

Is your feature request related to a problem? Please describe.
When training on large datasets, it's difficult to know how far along training is and whether or not I should stop training to tweak parameters to speed up the training process.

Describe the solution you'd like
Add tqdm progress bar to training progress. See context section. I have a suggested patch to trainer.py that adds this. Happy to PR it in, or feel free to take as you see appropriate.

Describe alternatives you've considered
One alternative is a manual print. But tqdm is the standard for this typically.

Additional context

trainer.py.patch.txt

I attached a patch file that adds an optional show_progress argument in a similar manner to what's used in DataLoader already. Note: This patch only shows it for the Trainer class, not the child classes. That is because this is for an example only.

The main change is this:

-        for batch_idx, interaction in enumerate(train_data):
+        iter_data = (
+            tqdm(
+                enumerate(train_data),
+                total=len(train_data),
+                desc=f"Train {epoch_idx:>5}",
+            )
+            if show_progress
+            else enumerate(train_data)
+        )
+        for batch_idx, interaction in iter_data:

Here is an example output while training a larger dataset. Scroll right to see full output.

Trainable parameters: 735267
Train     0:  20%|███████████████████████████▌                                                                                                           | 6480/31756 [10:47<42:09,  9.99it/s]

Model Performance

Which models have been tested? I would like to suggest you list some test results.

使用豆瓣数据集,卡死在DataLoader的遍历上

下载了豆瓣的数据集,大概长下面这样:

user_id:token	item_id:token	rating:float	timestamp:float	likes_num:float
0	0	3	1431446400	2404
1	0	2	1429804800	1231
2	0	2	1429977600	1052
3	0	4	1429718400	1045
4	0	2	1429632000	723

然后用文档中的代码:

config_dict = {
    "load_col": {
        "inter": ["user_id", "item_id", "rating", "timestamp", "likes_num"],
    }
}
run_recbole(model='LightGCN', dataset='douban', config_dict=config_dict)

通过断点调试发现在 trainer.py_train_epoch方法中:

for batch_idx, interaction in enumerate(train_data):
    print(batch_idx)

始终无法打印出 batch_idx

而使用 ml-100k 数据集是可以正常遍历的,请问是我使用方法不对还是如何?

刚开始接触这个,不太懂,求指教~

[🐛BUG] Waiting time for 1st epoch

Hi
My inter file has almost 8M lines.
How much time is it reasonable to wait at this stage:
image

I did check on a smaller file and it worked.
How can I configure to use more than one GPU?
Should I increase the log level in some way?

thanks!

[💡SUG] Get a prediction

Hi
How can I get a single prediction from a sequential model?
I have the *.pth file, how can I use it?

[💡SUG] Suggested hyperparameter ranges

Since most papers come with suggested hyperparameter options I think it would be useful to have those ranges listed somewhere so people don't need to dig into the paper for them. One possibility would be to have premade model.hyper Hyperopt files with the typical ranges. This would make it easier to try out different baselines, etc

Model cannot get access to interaction features other than item and time.

Describe the bug
I am developing a sequential model that needs the interaction features, such as rating and comment, but I find the tensor that the model gets from the corresponding field is not correct.

So I dive into the sequential dataloader, and find a issue

  1. In the function next_batch, either real_time or not, it both relies on the function augmentation
  2. In the function augmentation, based on my understanding, for a feature from a field out of item and time, the current implementation only record the features of the target item, referring to https://github.com/RUCAIBox/RecBole/blob/master/recbole/data/dataloader/sequential_dataloader.py#L138
    and then it records item index and time index, but ignores the other feature field of the history items, referring to
    https://github.com/RUCAIBox/RecBole/blob/master/recbole/data/dataloader/sequential_dataloader.py#L144

So that explains my case that when my model tries to get access to the rating of history items, the length of the corresponding tensor is 1, rather than being equal to the length of the tensor of max_seq_length

I think it could be a bug because basically, we will not care about the interaction features for the item to be predicted. Please be free to correct me if I am wrong.

Expected behavior
The sequential model can get access to other interaction features.

[🐛BUG] module 'recbole.data.dataset' has no attribute 'GRU4RecDataset' , KeyError: 'session_id'

Hi,
I tried to run GRU4Rec, this is the full error:


Traceback (most recent call last):
  File "C:\Users\Administrator\RecBole\recbole\data\utils.py", line 35, in create_dataset
    return getattr(importlib.import_module('recbole.data.dataset'), config['model'] + 'Dataset')(config)
AttributeError: module 'recbole.data.dataset' has no attribute 'GRU4RecDataset'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'session_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_recbole.py", line 25, in <module>
    run_recbole(model=args.model, dataset=args.dataset, config_file_list=config_file_list)
  File "C:\Users\Administrator\RecBole\recbole\quick_start\quick_start.py", line 38, in run_recbole
    dataset = create_dataset(config)
  File "C:\Users\Administrator\RecBole\recbole\data\utils.py", line 40, in create_dataset
    return SequentialDataset(config)
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\sequential_dataset.py", line 38, in __init__
    super().__init__(config, saved_dataset=saved_dataset)
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\dataset.py", line 100, in __init__
    self._from_scratch()
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\dataset.py", line 113, in _from_scratch
    self._data_processing()
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\dataset.py", line 152, in _data_processing
    self._data_filtering()
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\dataset.py", line 172, in _data_filtering
    self._filter_nan_user_or_item()
  File "C:\Users\Administrator\RecBole\recbole\data\dataset\dataset.py", line 607, in _filter_nan_user_or_item
    dropped_inter = self.inter_feat.index[self.inter_feat[field].isnull()]
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'session_id'

This is my config file:

# Atomic File Format
field_separator: "\t"
seq_separator: " "

# Common Features
USER_ID_FIELD: session_id 
ITEM_ID_FIELD: item_id
TIME_FIELD: timestamp
seq_len: ~

# Label for Point-wise DataLoader
LABEL_FIELD: label
threshold: ~

# NegSample Prefix for Pair-wise DataLoader
NEG_PREFIX: neg_

# Selectively Loading
load_col:
    inter: [session_id, item_id, timestamp, PatientLocationID,GenderID,AgeGroup, JobGroup]
    # the others
unload_col: ~
additional_feat_suffix: ~

# Filtering
rm_dup_inter: ~
max_user_inter_num: ~
min_user_inter_num: 0
max_item_inter_num: ~
min_item_inter_num: 0
lowest_val: ~
highest_val: ~
equal_val: ~
not_equal_val: ~
drop_filter_field : True

# Preprocessing
fields_in_same_space: ~
fill_nan: True
preload_weight: ~
drop_preload_weight: True
normalize_field: ~
normalize_all: True

# Sequential Model Needed
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id


# Benchmark .inter
benchmark_filename: ~

# general
gpu_id: 0
use_gpu: True
seed: 2020
state: INFO
reproducibility: True
data_path: 'dataset/Ofek'
checkpoint_dir: 'saved'

# training settings
epochs: 300
train_batch_size: 2048
learner: adam
learning_rate: 0.001
training_neg_sample_num: 1
eval_step: 1
stopping_step: 10

# evaluation settings
eval_setting: RO_RS,full
group_by_user: True
split_ratio: [0.8,0.1,0.1]
leave_one_num: 2
real_time_process: True
metrics: ["Recall", "MRR","NDCG","Hit","Precision"]
topk: [5]
valid_metric: MRR@5
eval_batch_size: 4096

and this is the header of my inter file:

JobGroup:token item_id:token PatientLocationID:token GenderID:token AgeGroup:token timestamp:float session_id:token
Where is my mistake?
Thanks!!

[💡SUG] Hyperparameter tuning without creating config files?

In other parts of the API it's possible to configure without creating a file, like with config_dict for run_recbole.

But, the hypertuning API

    def __init__(self, objective_function, space=None, params_file=None, fixed_config_file_list=None,
                 algo='exhaustive', max_evals=100):

Seems to only take files as configuration via params_file?

Is it not currently possible to do this programmatically as with config_dict, e.g.,

lightgcn_grid = {
    'embedding_size': [64, 128],
    'n_layers': [2, 3, 4],
    'reg_weight': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
}

[🐛BUG] Sparse dropout disabled during evaluation?

PyTorch currently doesn't have sparse dropout pytorch/pytorch#35798

I see that it's being emulated here

class SparseDropout(nn.Module):
and
def sparse_dropout(x, rate, noise_shape):

Regular PyTorch Dropout is disabled when model.eval is called because training is set to false on the model

self.model.eval()

Are these emulations of sparse dropout being disabled somewhere during evaluation?

Dataset.save Throw an error. TypeError: Object of type FeatureType is not JSON serializable

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
model = 'BPR'
dataset = 'ml-100k'
config_file_list = []
saved = True
param_dict = {
"use_gpu": False,
"topk": "[10]",
"valid_metric": "MRR@10"
}

configurations initialization

config = Config(model=model, dataset=dataset, config_file_list=config_file_list, config_dict=param_dict)
init_seed(config['seed'], config['reproducibility'])

dataset = Dataset(config)
dataset.save("D:/data/test/save_test")

  1. script for running

Traceback (most recent call last):
File "D:/source/recbole-0.1.2/recbole/example/dataset_eg.py", line 22, in
dataset.save("D:/data/test/save_test")
File "D:\source\recbole-0.1.2\recbole\data\dataset\dataset.py", line 1366, in save
json.dump(basic_info, file)
File "C:\Users\whdu.conda\envs\pytorch\lib\json_init_.py", line 179, in dump
for chunk in iterable:
File "C:\Users\whdu.conda\envs\pytorch\lib\json\encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "C:\Users\whdu.conda\envs\pytorch\lib\json\encoder.py", line 405, in _iterencode_dict
yield from chunks
File "C:\Users\whdu.conda\envs\pytorch\lib\json\encoder.py", line 405, in _iterencode_dict
yield from chunks
File "C:\Users\whdu.conda\envs\pytorch\lib\json\encoder.py", line 438, in _iterencode
o = _default(o)
File "C:\Users\whdu.conda\envs\pytorch\lib\json\encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type FeatureType is not JSON serializable

Screenshots
If applicable, add screenshots to help explain your problem.

Colab Links
If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

Desktop (please complete the following information):

  • OS: Windows
  • RecBole Version 0.1.2
  • Python Version 3.7
  • PyTorch Version 1.7.1
  • cudatoolkit Version none

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.