hkust-knowcomp / fmg Goto Github PK

KDD17_FMG

Python 18.91% M 0.31% MATLAB 33.34% C 3.11% Fortran 31.47% Makefile 0.23% C++ 11.65% Objective-C 0.97%

recommender-system factorization-machines heterogeneous-information-networks

fmg's Introduction

FMG

The code KDD17 paper "Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks" and extended journal version "Learning with Heterogeneous Side Information Fusion for Recommender Systems"

Readers are welcomed to fork this repository to reproduce the experiments and follow our work. Please kindly cite our paper

@inproceedings{zhao2017meta,
title={Meta-Graph Based Recommendation Fusion over Heterogeneous Information Networks},
author={Zhao, Huan and Yao, Quanming and Li, Jianda and Song, Yangqiu and Lee, Dik Lun},
booktitle={KDD},
pages={635--644},
year={2017}
}

@TechnicalReport{zhao2018learning,
title={Learning with Heterogeneous Side Information Fusion for Recommender Systems},
author={Zhao, Huan and Yao, Quanming and Song, Yangqiu and Kwok, James and Lee, Dik Lun},
institution = {arXiv preprint arXiv:1801.02411},
year={2018}
}

We released related datasets: yelp-200k, amazon-200k, yelp-50k and amazon-50k. Any problems, you can create an issue. Note that the amazon dataset is provied by Prof. Julian McAuley, thus if you use this dataset for your paper, please cite the authors' paper as instructed in the website http://jmcauley.ucsd.edu/data/amazon/

Instructions

For the sake of ease, a quick instruction is given for readers to reproduce the whole process on yelp-50k dataset. Note that the programs are testd on Linux(CentOS release 6.9), Python 2.7 and Numpy 1.14.0 from Anaconda 4.3.6.

Prerequisites

Unzip the file FMG_released_data.zip, and create a directory data in this project directory.
Move yelp-50k and amazon-50k into the data directory, then iteratively create directories sim_res/path_count and mf_features/path_count in directory data/yelp-50k/exp_split/1/.
Create directory log in the project by mkdir log.
Create directory fm_res in the project by mkdir fm_res.

Meta-graph Similarity Matrices Computation.

To generate the similarity matrices on yelp-50k dataset, run

python 200k_commu_mat_computation.py yelp-50k all 1

The arguments are explained in the following:

yelp-50k: specify the dataset.
all: run for all pre-defined meta-graphs.
1: run for the split dataset 1, i.e., exp_split/1

One dependent lib is bottleneck, you may install it with pip install bottleneck.

Meta-graph Latent Features Generation.

To generate the latent features by MF based on the simiarity matrices, run

python mf_features_generator.py yelp-50k all 1

The arguments are the same as the above ones.

Note that, to improve the computation efficiency, some modules are implements with C and called in python(see load_lib method in mf.py). Thus to successfully run python mf_features_generator.py you need to compile two C source files. The following scripts are tested on CentOS, and readers may take as references.

gcc -fPIC --shared setVal.c -o setVal.so
gcc -fPIC --shared partXY.c -o partXY.so

After the compiling, you will get two files in the project directory setVal.so and partXY.so.

FMG

After obtain the latent features, then the readers can run FMG model as following:

python run_exp.py config/yelp-50k.yaml -reg 0.5

One may read the comment in files in directory config for more information.

Misc

If you have any questions about this project, you can open issues, thus it can help more people who are interested in this project. I will reply to your issues as soon as possible.

fmg's People

Contributors

Stargazers

Watchers

fmg's Issues

AttributeError: 'module' object has no attribute 'argpartsort'

get 39244 review from data/yelp-50k/exp_split/1/rids.txt
get 10 aspect from data/yelp-50k/exp_split/1/aids.txt
UBU((13664, 13664)), density=0.95760 cost 6.83 seconds
UBUB((13664, 8165)), density=0.81269 cost 11.37 seconds
get top 500 items, total 6832000 entries, cost 16.31 seconds
save 6832000 triplets in data/yelp-50k/exp_split/1/sim_res/path_count/URPARUB_top500.res
finish saving 6832000 URPARUB entries in data/yelp-50k/exp_split/1/sim_res/path_count/URPARUB_top500.res, cost 27.77 seconds
cal commut mat for URNARUB, filenames: data/yelp-50k/exp_split/1/uids.txt, data/yelp-50k/exp_split/1/bids.txt, data/yelp-50k/exp_split/1/uid_pos_bid.txt
get 13664 user from data/yelp-50k/exp_split/1/uids.txt
get 8165 biz from data/yelp-50k/exp_split/1/bids.txt
get 39244 review from data/yelp-50k/exp_split/1/rids.txt
get 10 aspect from data/yelp-50k/exp_split/1/aids.txt
UBU((13664, 13664)), density=0.18198 cost 1.43 seconds
UBUB((13664, 8165)), density=0.27085 cost 3.08 seconds
get top 500 items, total 2960500 entries, cost 12.40 seconds
save 2960500 triplets in data/yelp-50k/exp_split/1/sim_res/path_count/URNARUB_top500.res
finish saving 2960500 URNARUB entries in data/yelp-50k/exp_split/1/sim_res/path_count/URNARUB_top500.res, cost 16.72 seconds
cal commut mat for UUB, filenames: data/yelp-50k/exp_split/1/uids.txt, data/yelp-50k/exp_split/1/bids.txt, data/yelp-50k/exp_split/1/uid_pos_bid.txt
get 13664 user from data/yelp-50k/exp_split/1/uids.txt
get 8165 biz from data/yelp-50k/exp_split/1/bids.txt
get 39244 review from data/yelp-50k/exp_split/1/rids.txt
get 10 aspect from data/yelp-50k/exp_split/1/aids.txt
UBU((13664, 13664)), density=0.00014 cost 0.00 seconds
UBUB((13664, 8165)), density=0.00118 cost 0.00 seconds
get top 500 items, total 129658 entries, cost 8.13 seconds
save 129658 triplets in data/yelp-50k/exp_split/1/sim_res/path_count/UUB_top500.res
finish saving 129658 UUB entries in data/yelp-50k/exp_split/1/sim_res/path_count/UUB_top500.res, cost 8.34 seconds
to dense RA(39244, 10) cost 0.00 seconds
Traceback (most recent call last):
File "200k_commu_mat_computation.py", line 476, in
cal_yelp_all(split_num, dt)
File "200k_commu_mat_computation.py", line 459, in cal_yelp_all
cal_rar(path_str)
File "200k_commu_mat_computation.py", line 338, in cal_rar
RAR_csr = cal_rar_block(RA, len(rid2ind), ind2rid, step=20000)
File "200k_commu_mat_computation.py", line 387, in cal_rar_block
top100_inds = bn.argpartsort(-dot_res, tmp_topK, axis=1)[:,:tmp_topK]#10000 * 100,100 indices of the top K weights, column indices in dot_res
AttributeError: 'module' object has no attribute 'argpartsort'
mldl@ub1604:/ub16_prj/FMG$
mldl@ub1604:/ub16_prj/FMG$

I'm confused about the dir "matlab" and i wonder how to use the dataset CIKM-YELP

May i ask what function is the code in "matlab"? I have noticed that the dataset CIKM-YELP and CIKM-Douban is a *.mat file. So i guess these codes are for these 2 datasets, am i right
then i wonder how to use the code on yelp.mat. it seems that the input format is not matched.

TypeError: 'numpy.float64' object cannot be interpreted as an index

ub16hp@UB16HP:/ub16_prj/FMG$ gcc -fPIC --shared setVal.c -o setVal.so
ub16hp@UB16HP:/ub16_prj/FMG$ gcc -fPIC --shared partXY.c -o partXY.so
ub16hp@UB16HP:/ub16_prj/FMG$ python mf_features_generator.py yelp-50k all 1
data: data/yelp-50k/exp_split/1/, path_str: all
finish load data from data/yelp-50k/exp_split/1/sim_res/path_count/UPBCatB_top500.res, cost 32.46 seconds, users: 13663, items=8133
start generate mf features, (K, eps, reg, iters) = (10, 10, 10, 500)
Traceback (most recent call last):
File "mf_features_generator.py", line 165, in
run_all_yelp()
File "mf_features_generator.py", line 91, in run_all_yelp
run(path_str)
File "mf_features_generator.py", line 49, in run
U,V = mf.run()
File "/home/ub16hp/ub16_prj/FMG/mf.py", line 122, in run
X = cm((self.data[:,2], (self.data[:,0], self.data[:,1]))) #index starting from 0
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 51, in init
other = self.class(coo_matrix(arg1, shape=shape))
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 150, in init
self._shape = check_shape((M, N))
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/sputils.py", line 281, in check_shape
new_shape = tuple(operator.index(arg) for arg in args)
File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/sputils.py", line 281, in
new_shape = tuple(operator.index(arg) for arg in args)
TypeError: 'numpy.float64' object cannot be interpreted as an index
ub16hp@UB16HP:/ub16_prj/FMG$

issues about rating value

Dr. Zhao, sorry to disturb you. when i read your paper, you write the sentence " FMG ignore the rating values， so it remains unknown whether it can further decrease RMSE if we adopt a similar approach to incorporate rating values into HIN". but in your codes and data, the file "ratings.txt" has shows the rating value. so i am confused about the rating value. can you explain it ?

关于mf_features的疑惑

在网上看的您的文章https://mp.weixin.qq.com/s/6XMJJQQKolv1AS3em1ICZg，感觉很不错，下载代码进行阅读。

请问mf_features/path_count下面的文件是从何而来的，这个貌似跟meta_graph有关，而您的代码中居然是直接用这些文件的，那么自己想本地跑一份应该如何做呢？谢谢

关于代码质量需要改善的地方貌似不少
（1）data文件夹放到指定路径
（2）mkdir log
（3）mkdir fm_res
（4）run_exp.py文件中run_regsvd函数中fm_ak_gl = MF(config, data_loader)这一句，MF未定义

Can't open the website of readme.md

this week ,I get nothing when I open the website of readme several times.md.It shows nothing .

Input parameters of mf.py

Hi Dr. Zhao, I performed the latest code recently and I finally stopped by an input error in the file 'mf.py'.

There is a class 'MF_BGD' in 'mf.py' and the attribute 'init' of it needs a parameter 'data', when this input parameter is 'None', the function will read a data file called 'data/ml-1m-rating.txt'. I found that there was no such file in your data package so I wonder if you mind showing an format example of the file here, cause I don't know which form to transform my data in or what kind of data I should use.

Also, I will be very appreciated if you can tell the details about all parameters of 'MF_BGD', such as 'train_data' and 'test_data'.

Thank you for your kindly help! I am looking forward to your reply.

关于RMSE精度的问题

您好，我今天用您提供的Amazon-200k的数据集和Yelp-200K的数据集时候发现了一些问题，比较困惑，不知道是我代码的问题还是您的论文中的方法的问题。我用了SVD++的方法，有global_bias， user_bias, item_bias, 其中global_bias是训练集的ratings_avg，然后发现这种方法在Yelp-200K的数据集上RMSE达到了1.1992，在Amazon_200k的数据集上RMSE达到了1.1512，特别简单的例子，我在Yelp-200K的数据集上，统计了ratings_train_1.txt中的rating的均值，然后在加上一个很小的随机数的bias，在ratings_test_1.txt中测试发现RMSE达到了1.266
想请问您，是数据集有问题吗？

How to process the original data into similarity matrix?

May I ask how to process the original data into similarity matrix? It seems that the file "mf_features_generator.py" is used to conduct matrix factorization from each similarity matrix corresponding to each meta graph. But how to get the similarity first? That is, how to implement section 2.1 in the paper?
Looking forward to your reply, thanks!

I'd like to ask something about computer configuration.

When I apply this project on my computer, the python ended up with Memoryerror.
So I want to know in order to finish this project, how much memory do I need?
Waiting for reply, thank you.

I would like to know if the code for HAF is provided

missing DataLoader statement in run_exp.py

Following Line/Stmt is missing from file run_exp.py (method: run())
data_loader = DataLoader(config)
you need to put this statement between line number 142 & 145.

An issue about code:cal_rar_block()

Hi~,I don`t understand what this function do in the code file '200k_commu_mat_computation.py'.
Could you give me a tip or explain it?Thank you very much!

The original data of amazon-200K

Hi, the dataset of amazon-200k is preprocessed and only has id numbers. Is it possible to provide the original metadata of products (i.e., the metadata of each row in bid.txt)? Thanks very much.

关于如果处理amazon-200k数据的疑惑

您好，在使用amazon-200k的数据的时候，我有两个疑问：
（1）200k_commu_mat_computation.py 文件似乎是只针对yelp数据进行计算，而没有amazon数据的相关代码
（2）在我手动修改了200k_commu_mat_computation.py部分代码后，对amazon-200k的数据进行处理时，发现在处理“URPARUB”这个path-str的时候，出现了错误MemoryError: Unable to allocate 22.2 GiB for an array with shape (2981433917,) and data type int64。请问您之前也是使用该文件对amazon进行处理的吗？

What does "neg" and "pos" mean in the dataset?

Hi Dr. Zhao, I've downloaded the datasets and run the code successfully. But I'm quite confused about what the "neg" and "pos" mean in the files "uid_neg_bid.txt" and "uid_pos_bid.txt" and also in files "uid_rid_neg_aid.txt" and "uid_rid_neg_aid_weight.txt". I guess the "neg" means lower ratings and "pos" means higher ratings. But after comparing these files with ratings.txt, I found uid_neg_bid.txt contains ratings from 1 to 4 and uid_pos_bid.txt contains ratings from 1 to 5. It seems that my assumption is wrong.
Could you kindly explain what "pos" and "neg" mean in these files? Thank you for your help! Looking forward to your reply.

What meas about this dataset

Sorry to disturb you sir ,but I do not know what means about this dataset such as aids.txt , bid_cat.txt , bid_city.txt from yelp-50k\exp_split\1, What does each row's argument mean. In other way , from sim_res\path_count , su as UNBUB_top500.res , what does each row's argument mean , and the parameter for third column is not represent rating, right?

Whether batch processing is implemented

Hi author:
I want to know whether batch processing is implemented in your experiment code.

An issue about datasets

Merry Christmas Dr. zhao, sorry to disturb you. I have unzipped the amazon/yelp datasets and found the uid_rid_pos_aid_weight.txt in them. Are them the origin files in each datasets? If not , how did you calculate the weight for each user in each datasets?

No module named yaml

Hello~ When I run the following command "python run_exp.py config/yelp-50k.yaml", an error occurred:
Traceback (most recent call last):
File "run_exp.py", line 11, in
import yaml
ImportError: No module named yaml

do you have any idea why this would happen?
My python version is 2.7.10, and OS is macOs 10.14.2. And I did fine on previous steps.

What meas about the datasets of positive

I have learn about the fact that the dataset has uid_pos_bid , is the pos means positive, and what mean by saying uid and bid is positive ? if user by some item and give a high rating?

ValueError: cannot reshape array of size 49639 into shape (148917,1)

Traceback (most recent call last):
File "/home/zyl/Projects/Heterodata/FMG/mf_features_generator.py", line 165, in
run_all_yelp()
File "/home/zyl/Projects/Heterodata/FMG/mf_features_generator.py", line 95, in run_all_yelp
run(path_str)
File "/home/zyl/Projects/Heterodata/FMG/mf_features_generator.py", line 49, in run
U,V = mf.run()
^^^^^^^^
File "/home/zyl/Projects/Heterodata/FMG/mf.py", line 135, in run
obs = omega.copy().data.astype(np.float64).reshape(self.train_num, 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 49639 into shape (148917,1)
When I run" python mf_features_generator.py yelp-50k all 1", this issue appears. Can you help me out? Thank you.

About Data

I don't understand the meaning of the '.txt 'file in the data folder, such as' uid_pos_bid.txt'. At the same time, I don't know how to get the data from the original data.

there is no code for The Co-Evolution Model for Social Network Evolving and Opinion Migration

Yupeng Gu, Yizhou Sun, Jianxi Gao, "The Co-Evolution Model for Social Network Evolving and Opinion Migration," Proc. of 2017 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'17), Halifax, Nova Scotia, Canada, Aug. 2017. [Code] [video]

code link broken