hkuds / lightgcl Goto Github PK

View Code? Open in Web Editor NEW

162.0 162.0 16.0 52.45 MB

[ICLR'2023] "LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation"

Home Page: https://arxiv.org/abs/2302.08191

Python 100.00%

collaborative-filtering graph-contrastive-learning graph-neural-networks recommendation self-supervised-learning

lightgcl's Introduction

Hi there 👋

✨Welcome to the Data Intelligence Lab @ HKU!✨

🚀 Our Lab is Passionately Dedicated to Exploring the Forefront of the Data Science & AI 👨‍💻

lightgcl's People

Contributors

Stargazers

Watchers

Forkers

cmr123456 tmukande-debug yukinoasuna akaxlh ychuest ywhuazhong deidnani guanghaoliang jinzong53 lightningwar paul9615 isheng-z dhfgoeofh hieunt1410 thatsshirleylee ainiexinxin

lightgcl's Issues

infoNCE

你好，我是一个小白，看的论文不多，问的问题可能会很幼稚，但是还是希望你能解答我的疑惑。我在大部分论文中看到的infoNCE所用的只有正样本，而负样本只用在rec任务的bpr损失中。而这篇论文代码的infoNCE中都用到了正负样本，我尝试将iid修改为pos，只用正样本，效果略微下降，但依旧表现良好。请问，这两种方法有什么区别吗，哪种方法好一点？

Some questions about the correctness of the code

Dear anthors,
I am interested in the simple while effective approach you propose. In February, I noticed this paper and download its code from https://anonymous.4open.science/r/LightGCL/. Recently I wanna do some improvements based on this work, but I notice that the code of the initial version may be incorrect? For InfoNCE loss, the code implementation is as following:

u_mask = (torch.rand(len(uids))>0.5).float().cuda(self.device)
gnn_u = nn.functional.normalize(self.Z_u_list[l][uids],p=2,dim=1)
hyper_u = nn.functional.normalize(self.G_u_list[l][uids],p=2,dim=1)
hyper_u = self.Wsl-1
pos_score = torch.exp((gnn_u*hyper_u).sum(1)/self.temp)
neg_score = torch.exp(gnn_u @ hyper_u.T/self.temp).sum(1)
loss_s_u = ((-1 * torch.log(pos_score/(neg_score+1e-8) + 1e-8))*u_mask).sum()

The neg_score should be "torch.exp(gnn_u @ self.G_u_list[l].T/self.temp).sum(1)" instead of the code above in my opinion.
And I am pretty confused about it and whether the code is correct. If not correct, how can you get such state-of-art performance?
Look forward to your reply soon.

The actual Train/Validate/Testing splitting is not consistent with Paper.

In the section 4.1.1 DATASETS AND EVALUATION PROTOCOLS, "we split the datasets into training, validation and testing sets with a ratio of 7:2:1."

However, in the code, i find out that the raw data only contain two trnMat.pkl and tstMat.pkl files. Also, the implementation directly use the testing set in the training stage as validation set. Is there anything missing when i read the code?

不好意思看错了

# sample pos and neg pos = [] neg = [] iids = set() for i in range(len(batch_users)): u = batch_users[i] u_interact = train_csr[u].toarray()[0] positive_items = np.random.permutation(np.where(u_interact==1)[0]) negative_items = np.random.permutation(np.where(u_interact==0)[0]) item_num = min(max_samp,len(positive_items)) positive_items = positive_items[:item_num] negative_items = negative_items[:item_num] pos.append(torch.LongTensor(positive_items).cuda(torch.device(device))) neg.append(torch.LongTensor(negative_items).cuda(torch.device(device))) iids = iids.union(set(positive_items)) iids = iids.union(set(negative_items))

在main.py中的第132-137行代码，模型选用的采样方法是假随机的。在每轮训练过程中，user的顺序是固定的而非随机。这很容易导致模型过拟合。
我对模型的代码重新做了修改，改变了其假随机的采样方法（具体方法参考recbole框架下的采样方法以及LightGCN的采样方法）。结果是模型在yelp数据集上，recall@20上有明显提升，为0.0938，ndcg@20有降低，为0.0508。但是在横向对比中，LightGCL的表现甚至还不如LightGCN，均低了0.02左右。
我怀疑作者在选用其他模型进行对比实验的时候也采用的是这种“假随机”采样，进而导致论文中报告的其他baseline模型的指标远远低于正常情况。

ps.另外还有一点，在改变模型的采样方法后，其他参数不变，模型出现了梯度爆炸情况，我认为这是在对比学习过程中加入权重W但未normalize的原因。所以我同时做了去掉W以及调大temp两种方法，以避免梯度爆炸，其结果依旧很差。

Yelp dataset preprocessing

can I know how you preprocess the yelp dataset?
Thank you!

About dataset splits.

Hi. I am a very interested reader of your paper. I have a question about dataset splits and would be grateful if you answer me when you have time.

In the paper, each dataset has the following interactions:

Yelp: 1,517,326
Gowalla: 1,172,425
ML-10M: 9,988,816
Amazon-book: 2,240,156
Tmall: 2,357,450

and all datasets are split into training, validation, and testing with a ratio of 7:2:1, which means each training, validation, and testing dataset has the following interactions:

Yelp: 1,062,128 / 303,465 / 151,733 (70%, 20%, 10%)
Gowalla: 820,697 / 234,485 / 117,243 (70%, 20%, 10%)
ML-10M: 6,992,171 / 1,997,763 / 998,882 (70%, 20%, 10%)
Amazon-book: 1,568,109 / 448,031 / 224,016 (70%, 20%, 10%)
Tmall: 1,650,215 / 471,490 / 235,745 (70%, 20%, 10%)

But I found that the uploaded dataset has the following interactions: (trnMat.pkl and tstMat.pkl)

Yelp: 1069128 / - / 305466 (70%, -, 20%)
Gowalla: 1172425 / - / 130270 (100%, -, 10%)
ML-10M: 6999171 / - / 1999761 (70%, -, 20%)
Amazon-book: 2240156 / - / 640045 (100%, -, 30%)
Tmall: 2357450 / - / 261939 (100%, -, 10%)

which is far different from the paper. Am I missing something important?
I look forward to hearing back from you.

论文的SVD分解是否有意义

经过消融实验发现删去SVD分解，与自身节点对比结果也是一样的，并且--lambda1的值为1e-7这么小，cl_loss是否真的有作用呢

在Linux上运行遇到了一些问题。

但是，在windows上运行就没有同样的情况。请问是哪方面的原因呢？是cuda版本的原因吗？

SimGCL在论文中处于underfitting的状态

作者你好，我是SimGCL的作者，经常有关注贵组的工作。最近也有打算复现LightGCL。

但我发现在LightGCL的论文中，SimGCL似乎完全处于欠拟合或者错误选择超参数的状态。我使用了本repo里提供的yelp和gowalla数据集重新对SimGCL进行了测试。在保持general setting与文中一致的情况下，我凭经验随机选择了 lambda_cl = 0.2, epsilon=0.1, tau=0.2的组合（后两个值对大多数数据集来说是SimGCL的相对最优超参），在迭代所有训练样本第二次之后的结果即超过了LightGCL的结果，也远超SimGCL在文中的结果。我的结果如下：

Yelp: SimGCL 第二次迭代 Recall@20: 0.0962 NDCG@20: 0.0833
Yelp: SimGCL 收敛（第九次） Recall@20: 0.1048 NDCG@20: 0.0903
yelp: SimGCL 论文汇报结果 Recall@20: 0.0718 NDCG@20: 0.0615
yelp: 文中LightGCL结果 Recall@20: 0.0793 NDCG@20: 0.0668

Gowalla: SimGCL 第二次迭代 Recall@20: 0.1739 NDCG@20: 0.1060
Gowalla: SimGCL 收敛（第十次）Recall@20: 0.1893 NDCG@20: 0.1145
Gowalla: SimGCL 论文汇报结果 Recall@20: 0.1357 NDCG@20: 0.0818
Gowalla: 文中LightGCL结果 Recall@20: 01578 NDCG@20: 0.0935

文中提到 “To ensure a fair comparison, we tune the hyperparameters of all the baselines within the ranges suggested in the original papers.” 但我发现贵组实验中实际上可能使用了 lambda_cl = 0.01, tau=0.1。 lambda_cl = 0.01在SimGCL的文章参数敏感性实验中已被表明为三个数据集上较差的选择。另外SimGCL文章中也提到“ In SimGCL and SGL, we empirically let the temperature 𝜏 = 0.2, and this value is also reported as the best in the original paper of SGL.” tau这个参数实际上是较为敏感的，我的经验是0.2变到0.1会出现结果的较大波动。似乎LightGCL实验时并未参考SimGCL的文章。

以上关于SimGCL的结果均为通过SELFRec得到。感兴趣的话可以对比是否我们关于SimGCL的实现有不同之处，导致了论文里面的问题。

------------------------- UPDATE---------------------------------------------------------------------

我用lambda_cl = 0.2, epsilon=0.1, tau=0.2的组合在贵组的SSLRec上尝试了一下，yelp dataset上经过两次迭代后为

Recall@20: 0.0929 NDCG@20: 0.0791

我没有跑完掐掉了，但两次迭代也好于文中汇报的SimGCL与LightGCL的结果。

然后我也使用了SSLRec里面默认的SimGCL参数 lambda_cl = 0.01, epsilon = 0.2, tau=0.1, 三次迭代记录如下：

{'optimizer': {'name': 'adam', 'lr': 0.001, 'weight_decay': 0}, 'train': {'epoch': 100, 'batch_size': 256, 'save_model': False, 'loss': 'pairwise', 'test_step': 1}, 'test': {'metrics': ['recall', 'ndcg'], 'k': [10, 20], 'batch_size': 256}, 'data': {'type': 'general_cf', 'name': 'yelp', 'user_num': 29601, 'item_num': 24734}, 'model': {'name': 'simgcl', 'keep_rate': 1.0, 'layer_num': 2, 'reg_weight': 1e-06, 'cl_weight': 0.01, 'temperature': 0.1, 'embedding_size': 32, 'eps': 0.2}, 'tune': {'enable': False}, 'device': 'cuda'}
Training Recommender: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4177/4177 [03:48<00:00, 18.26it/s]
[Epoch 0 / 100] bpr_loss: 0.2147 reg_loss: 0.0304 cl_loss: 0.1139
[recall@10: 0.0439 recall@20: 0.0728 ] [ndcg@10: 0.0533 ndcg@20: 0.0621 ]
Training Recommender: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4177/4177 [03:48<00:00, 18.29it/s]
[Epoch 1 / 100] bpr_loss: 0.1037 reg_loss: 0.0536 cl_loss: 0.1022
[recall@10: 0.0478 recall@20: 0.0810 ] [ndcg@10: 0.0582 ndcg@20: 0.0686 ]
Training Recommender: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4177/4177 [03:48<00:00, 18.28it/s]
[Epoch 2 / 100] bpr_loss: 0.0929 reg_loss: 0.0623 cl_loss: 0.0932
[recall@10: 0.0499 recall@20: 0.0840 ] [ndcg@10: 0.0605 ndcg@20: 0.0711 ]

第二次迭代完之后也已经超过了LightGCL的结果。也可以看出这组参数确实差于SimGCL论文里推荐选择的参数。

parser1中的data

在parser1中，default='yelp'，，但我修改yelp，改为gowalla和ml10m数据集，还是在跑的yelp数据集，请问作者这是为什么？怎么修改为在其它数据集上跑？

About the time complexity

Very nice work! While I find the time complexity I computed for graph convolution of LightGCL should be O[2ELd + 2IJLd] which is not aligned with that in the paper in Table 2. I think the reconstructed graph is fully connected (a dense matrix) and can not use sparse matrix multiplication to make acceleration. Can you help me figure it out? Thanks!

The performance in the paper is inconsistent with that the code does.

On Yelp, the performance of the paper is low (Recall@20: 0.0793 Ndcg@20:0.0668 Recall@40:0.1292 Ndcg@40:0.0852). The actual running is high( Recall@20: 0.1005596274687573 Ndcg@20: 0.08650433827615736 Recall@40: 0.1597782188737718 Ndcg@40: 0.10812958940033758).

The details of the actual running is following:

Test of epoch 96 : Recall@20: 0.10051406435437939 Ndcg@20: 0.08627444271978929 Recall@40: 0.15986645365267868 Ndcg@40: 0.10792282233659943
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [00:24<00:00, 10.64it/s]
Epoch: 97 Loss: 2.5118775304037197 Loss_r: 0.3031725097476071 Loss_s: 2.2050413157193716
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [00:24<00:00, 10.68it/s]
Epoch: 98 Loss: 2.511937270637687 Loss_r: 0.30322049376163773 Loss_s: 2.205053930974189
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [00:24<00:00, 10.59it/s]
Epoch: 99 Loss: 2.5120028903466145 Loss_r: 0.30331670953572254 Loss_s: 2.205021769945858
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:08<00:00, 13.75it/s]
-------------------------------------------
Test of epoch 99 : Recall@20: 0.1005596274687573 Ndcg@20: 0.08650433827615736 Recall@40: 0.1597782188737718 Ndcg@40: 0.10812958940033758
-------------------------------------------
Final test: Recall@20: 0.1005596274687573 Ndcg@20: 0.08650433827615736 Recall@40: 0.1597782188737718 Ndcg@40: 0.10812958940033758

On ML-10M, the performance of the paper is high (Recall@20: 0.2613 Ndcg@20:0.3106 Recall@40:0.3799 Ndcg@40:0.3387 1). The actual running is low (Recall@20: 0.22966711970088424 Ndcg@20: 0.28407235346796683 Recall@40: 0.31642916993719605 Ndcg@40: 0.30047428117834374)

The details of the actual running is following:

Epoch: 98 Loss: 2.513269269026951 Loss_r: 0.3047634800356546 Loss_s: 2.18953038922104
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1709/1709 [07:04<00:00,  4.03it/s]
Epoch: 99 Loss: 2.5132692800480556 Loss_r: 0.3047331753462082 Loss_s: 2.1895613562928227
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 273/273 [00:23<00:00, 11.86it/s]
-------------------------------------------
Test of epoch 99 : Recall@20: 0.22966711970088424 Ndcg@20: 0.28407235346796683 Recall@40: 0.31642916993719605 Ndcg@40: 0.30047428117834374
-------------------------------------------
Final test: Recall@20: 0.22966711970088424 Ndcg@20: 0.28407235346796683 Recall@40: 0.31642916993719605 Ndcg@40: 0.30047428117834374

请问如何构建自己的数据集？谢谢！

BUG in svd_u,s,svd_v = torch.svd_lowrank(adj, q=svd_q)

While implementing the code, it appeared an bug

torch._C._LinAlgError: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling cusolverDnXgeqrf( handle, params, m, n, CUDA_R_32F, reinterpret_cast<void*>(A), lda, CUDA_R_32F, reinterpret_cast<void*>(tau), CUDA_R_32F, reinterpret_cast<void*>(bufferOnDevice), workspaceInBytesOnDevice, reinterpret_cast<void*>(bufferOnHost), workspaceInBytesOnHost, info). This error may appear if the input matrix contains NaN.

But we have no idea to fix it.

About the reproducibility and validation set

Hi. I am a reader who read your paper with great interest. After running your code myself, I have two questions. I would be grateful if you could answer them when you have time.

When I run the command "python main.py --data yelp", the performance is higher than the performance in your paper. I'm wondering why the results are different.
[Performance table of my experiments]

		Recall@20	Recall@40	NDCG@20	NDCG@40
Yelp	reported	0.0793	0.1292	0.0668	0.0778
	reproduced	0.1001	0.1587	0.0868	0.1080
Gowalla	reported	0.1578	0.2245	0.0935	0.1108
	reproduced	0.2124	0.2993	0.1236	0.1464

In the paper, the validation set is used 20% of the whole data, but in the code, it is not used. Also, early stopping is not used in the code, so I'm wondering if the reported results in the paper are not using early stopping.

Thank you for your time.
I look forward to hearing back from you.

在局部图依赖关系建模过程中每一层embedding计算的问题

作者您好！在图卷积的过程中，您在论文中给出的公式是：z(u)i,l = σ(p(Ai,:)E(v)l-1)，e(u)i,l = z(u)i,l + e(u)i,l-1。
但是您的代码实现是：

aggregate

self.E_u_list[layer] = self.Z_u_list[layer]
self.E_i_list[layer] = self.Z_i_list[layer]
请问这里是否存在出入呢？是否应该加上self.E_u_list[layer-1]呢？

The performance in the paper is lower than that the code does.

The performance in the paper is lower than that the code does. For example, on gowalla, the actual running is Recall@20: 0.2103 Ndcg@20: 0.1223 Recall@40: 0.2991 Ndcg@40: 0.1453; but the paper is low (Recall@20: 0.1578 Ndcg@20: 0.0935 Recall@40: 0.2245 Ndcg@40: 0.1108). Is the paper not updated ?

analysis of the uniformity trend during training

实验结果一致性

在LightGCN原始论文中也有关于gowalla数据集的在卷积层数为2的情况下的recall@20和ndcg@20，在卷积层数为2的情况下，为什么在你们的论文中的LightGCN展示的实验结果与LightGCN原始论文中的差距那么大呢？

How to visualize Embedding distributions

Thank you for your wonderful work, I'm a newbie about recommendation, and I was reading your article "LIGHTGCL: SIMPLE YET EFFECTIVE GRAPH CONTRASTIVE LEARNING FOR RECOMMENDATION", and I saw this visualization below. (on top of your existing code in this repository)?

关于对比算法HCCF在不同数据集上的参数设置

您好，我想请教一下当前实验下，您在运行对比算法HCCF时，在各个数据集上分别是怎么设置的参数的呢？（我目前是在Gowalla数据集上运行，但是效果会差特别多）

neg_score 出现Inf导致loss nan

在自定义的数据上出现neg_score 出现Inf导致loss nan的情况，主要bug代码在于下面

LightGCL/model.py

Lines 78 to 80 in 5590453

 neg_score = torch.log(torch.exp(G_u_norm[uids] @ E_u_norm.T / self.temp).sum(1) + 1e-8).mean() 

 neg_score += torch.log(torch.exp(G_i_norm[iids] @ E_i_norm.T / self.temp).sum(1) + 1e-8).mean() 

 pos_score = (torch.clamp((G_u_norm[uids] * E_u_norm[uids]).sum(1) / self.temp,-5.0,5.0)).mean() + (torch.clamp((G_i_norm[iids] * E_i_norm[iids]).sum(1) / self.temp,-5.0,5.0)).mean()

	neg_score = torch.log(torch.exp(G_u_norm[uids] @ E_u_norm.T / self.temp).sum(1) + 1e-8).mean()
	neg_score += torch.log(torch.exp(G_i_norm[iids] @ E_i_norm.T / self.temp).sum(1) + 1e-8).mean()
	pos_score = (torch.clamp((G_u_norm[uids] * E_u_norm[uids]).sum(1) / self.temp,-5.0,5.0)).mean() + (torch.clamp((G_i_norm[iids] * E_i_norm[iids]).sum(1) / self.temp,-5.0,5.0)).mean()