jyonn / gnrs Goto Github PK

news representation learning

Python 100.00%

gnrs's Introduction

GreenRec: Green AI Benchmarking for News Recommendation

⭐ GreenRec is implemented under the Legommenders framework. All the experiments can be reproduced with Legommenders. This repository will no longer be maintained, thanks!

Environment

pip install -r requirements.txt

Data Processing

Please specify the path to the data in python file

cd process/mind
python processor.py

Configuration

Data

Please refer to config_v2/data/mind.yaml for the data configuration.

Model

We support the following models on both MIND small and large datasets:

	NAML	LSTUR	NRMS	DCN	DIN	BST
ID-based	ID-NAML	ID-LSTUR	ID-NRMS	DCN	DIN	BST
text-based	NAML	LSTUR	NRMS	text-DCN	text-DIN	text-BST
PLMNR	PLMNR-NAML	PLMNR-LSTUR	PLMNR-NRMS	PLMNR-DCN	PLMNR-DIN	PLMNR-BST
BERT	BERT-NAML	BERT-LSTUR	BERT-NRMS	BERT-DCN	BERT-DIN	BERT-BST
MFT	MFT-NAML	MFT-LSTUR	MFT-NRMS	MFT-DCN	MFT-DIN	MFT-BST

Training and Testing

python worker.py 
    --config config/data/mind.yaml 
    --model config/model/nrms.yaml 
    --exp config/exp/tt-nrms.yaml
    --embed config/embed/null.yaml
    --version small-v2

gnrs's People

Contributors

Stargazers

Watchers

Forkers

cadobe

gnrs's Issues

TypeError: model.operator.attention_operator.AttentionOperatorConfig() got multiple values for keyword argument 'hidden_size'

I do follow the lead in the README exactly, but after run the worker.py with the same confs, i got an TypeError as shown in the picture. Could you please tell me how to fix it? @Jyonn Thanks.
Traceback (most recent call last): File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in <module> worker = Worker(config=configuration) File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in __init__ self.config_manager = ConfigManager( File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 219, in __init__ self.recommender = self.recommender_class( File "/Users/chuanqijiao/GNRS-master/model/recommenders/base_neg_recommender.py", line 26, in __init__ super().__init__(**kwargs) File "/Users/chuanqijiao/GNRS-master/model/recommenders/base_recommender.py", line 76, in __init__ self.user_config = self.user_encoder_class.config_class( TypeError: model.operator.attention_operator.AttentionOperatorConfig() got multiple values for keyword argument 'hidden_size'

How to load bert embeddings

I tried to load bert embeddings of news texts with 'bert-token.yaml' and use 'dcn.yaml' as the recommend model. After preprocess the data with bert_processor.py, i realize it only tokenize the text. When load the data.npy in embedding_loader.py, i print out the embedding and realize there are only tokens and no bert embeddings. How can i extract the bert embeddings and load it to the model?

print out the embedding variable
`{'nid': array([0, 1, 2, ..., 65235, 65236, 65237], dtype=object), 'cat': array([list([9580]), list([2740]), list([2739]), ..., list([2739]), dtype=object),
'title': array([list([1996, 9639, 3035, 3870, 1010, 3159, 2798, 1010, 1998, 3159, 5170, 8415, 2011]),..., list([3901, 1997, 4916, 2237, 5998, 2007, 3571, 2044, 9288]),dtype=object),
'abs': array([list([4497, 1996, 14960, 2015, 1010, 17764, 1010, 1998, 2062, 2008, 1996, 15426, 2064, 1005, 1056, 2444, 2302, 1012]), list([2122, 9428, 19741, 14243, 2024, 3173, 2017, 2067, 1998, 4363, 2017, 2013, 8328, 4667, 2008, 18162, 7579, 6638, 2005, 2204, 1012]),...,list([])], dtype=object)}

the error
Traceback (most recent call last):
File "/Users/chuanqijiao/GNRS-master/worker.py", line 395, in
worker = Worker(config=configuration)
File "/Users/chuanqijiao/GNRS-master/worker.py", line 54, in init
self.config_manager = ConfigManager(
File "/Users/chuanqijiao/GNRS-master/loader/config_manager.py", line 196, in init
self.embedding_manager.load_pretrained_embedding(**Obj.raw(embedding_info))
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_manager.py", line 66, in load_pretrained_embedding
self.pretrained[vocab_name] = EmbeddingInfo(**kwargs).load()
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 39, in load
self.embedding = getter(self.path)
File "/Users/chuanqijiao/GNRS-master/loader/embedding/embedding_loader.py", line 21, in get_numpy_embedding
return torch.tensor(embedding, dtype=torch.float32)
TypeError: can't convert np.ndarray of type numpy.object. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.`

Besides, the configs look a little bit confusing to me. If i try to load bert embedding and not use image features, can i use the following config?
mind.yaml-->dcn/din/bst/pnn.yaml-->tt.yaml-->bert-token.yaml

How to get original text data in the training process

I noticed that the data in a batch is represented as Unitok objects, which is the result of tokenization using Unitok (after processor.py). I'm wondering if there is a way to map these tokenized results back to the original text data. For example, if a nid token is 234 that can be mapped to N25648 in the original dataset, then original title data can be found using the N25648 index? Is there a way to do that?