Comments (12)
from inttower.
thank you again!
from inttower.
hi, when I run the train_taobao_IntTower.py base on the dataset download from Alibaba ads, I meet the nan problem after the first step training. I also try to use clip_grad_value but it did not work.
In addition, when I delete the total_loss.backward, there is no nan in the forward process, so I think the dataset is correct.
Could you meet the same problem when running train_taobao_IntTower.py, or give some advice for this problem?
Thank you very much!
from inttower.
maybe you can adjust contrast loss , or you can delete contrast loss, it's not stable。second, reduce learn rate。
Please let me know if there is any result
from inttower.
maybe you can adjust contrast loss , or you can delete contrast loss, it's not stable。second, reduce learn rate。 Please let me know if there is any result
Thank you for your advice.
After deleting the contrast loss, there is no nan when training.
But the performance is pooler than the result reported in the paper:
test LogLoss 0.2265
test AUC 0.6684
Could you give some advice for reproducing the result of the paper?
from inttower.
you could try to add user head and item head ,and fe-block fied dim,try this code
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6)
Please let me know your result
from inttower.
I try this code:
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6) and get follow result:
test LogLoss 0.3355
test AUC 0.6387
the performance is more pooler.
from inttower.
this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?
from inttower.
I'll try it later,maybe on monday I will give you my results
from inttower.
this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?
I don't modify any parameter when training, I just simply use the script: https://github.com/archersama/IntTower/blob/main/train_taobao_IntTower.py for training under GPU(V100).
I delete the contrast loss because of the nan problem.
This problem doesn't exist in amazon and movielens dataset.
from inttower.
I'll try it later,maybe on monday I will give you my results
Thank you very much. Please let me know your result
from inttower.
you can try this code to train inttower model, but we think user_hist feature may cause data leak, so we don't suggest this feature. And we will update taobao results on our new paper version. Inttower still benefits than dssm without this feature.
import numpy as np
import pandas as pd
import torch
import torchvision
import random
import time
from torch.utils.tensorboard import SummaryWriter
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from preprocessing.inputs import SparseFeat, DenseFeat, VarLenSparseFeat
from model.dssm import DSSM
from model.wdm import WideDeep
from model.autoint import AutoInt
from model.dcn import DCN
from model.IntTower import IntTower
from deepctr_torch.callbacks import EarlyStopping, ModelCheckpoint
def optimiz_memory(raw_data):
optimized_g2 = raw_data.copy()
g2_int = raw_data.select_dtypes(include=['int'])
converted_int = g2_int.apply(pd.to_numeric,downcast='unsigned')
optimized_g2[converted_int.columns] = converted_int
g2_float = raw_data.select_dtypes(include=['float'])
converted_float = g2_float.apply(pd.to_numeric,downcast='float')
optimized_g2[converted_float.columns] = converted_float
return optimized_g2
def optimiz_memory_profile(raw_data):
optimized_gl = raw_data.copy()
gl_int = raw_data.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
optimized_gl[converted_int.columns] = converted_int
gl_obj = raw_data.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
num_unique_values = len(gl_obj[col].unique())
num_total_values = len(gl_obj[col])
if num_unique_values / num_total_values < 0.5:
converted_obj.loc[:,col] = gl_obj[col].astype('category')
else:
converted_obj.loc[:,col] = gl_obj[col]
optimized_gl[converted_obj.columns] = converted_obj
return optimized_gl
def data_process(profile_path,ad_path,user_path):
profile_data = pd.read_csv(profile_path)
ad_data = pd.read_csv(ad_path)
user_data = pd.read_csv(user_path)
profile_data = optimiz_memory_profile(profile_data)
ad_data = optimiz_memory(ad_data)
user_data = optimiz_memory(user_data)
profile_data.rename(columns={'user':'userid'}, inplace = True)
user_data.rename(columns={'new_user_class_level ':'new_user_class_level'}, inplace=True)
df1 = profile_data.merge(user_data, on="userid")
data = df1.merge(ad_data, on="adgroup_id")
data['brand'] = data['brand'].fillna('-1', ).astype('int32')
# data['age_level'] = data['age_level'].fillna('-1', )
# data['cms_segid'] = data['cms_segid'].fillna('-1', )
# data['cms_group_id'] = data['cms_group_id'].fillna('-1', )
# data['final_gender_code'] = data['final_gender_code'].fillna('-1', )
data['pvalue_level'] = data['pvalue_level'].fillna('-1', ).astype('int32')
# data['shopping_level'] = data['shopping_level'].fillna('-1', )
# data['occupation'] = data['occupation'].fillna('-1', )
data['new_user_class_level'] = data['new_user_class_level'].fillna('-1', ).astype('int32')
data = data.sort_values(by='time_stamp', ascending=True)
return data
def get_user_feature(data):
data_group = data[data['clk'] == 1]
data_group = data_group[['userid', 'adgroup_id']].groupby('userid').agg(list).reset_index()
data_group['user_hist'] = data_group['adgroup_id'].apply(lambda x: '|'.join([str(i) for i in x]))
data = pd.merge(data_group.drop('adgroup_id', axis=1), data, on='userid')
data_group = data[['userid', 'clk']].groupby('userid').agg('mean').reset_index()
data_group.rename(columns={'overall': 'user_mean_rating'}, inplace=True)
data = pd.merge(data_group, data, on='userid')
return data
def get_var_feature(data, col):
key2index = {}
def split(x):
key_ans = x.split('|')
for key in key_ans:
if key not in key2index:
# Notice : input value 0 is a special "padding",\
# so we do not use 0 to encode valid feature for sequence input
key2index[key] = len(key2index) + 1
return list(map(lambda x: key2index[x], key_ans))
var_feature = list(map(split, data[col].values))
var_feature_length = np.array(list(map(len, var_feature)))
max_len = max(var_feature_length)
var_feature = pad_sequences(var_feature, maxlen=max_len, padding='post', )
return key2index, var_feature, max_len
def get_test_var_feature(data, col, key2index, max_len):
print("user_hist_list: \n")
def split(x):
key_ans = x.split('|')
for key in key_ans:
if key not in key2index:
# Notice : input value 0 is a special "padding",
# so we do not use 0 to encode valid feature for sequence input
key2index[key] = len(key2index) + 1
return list(map(lambda x: key2index[x], key_ans))
test_hist = list(map(split, data[col].values))
test_hist = pad_sequences(test_hist, maxlen=max_len, padding='post')
return test_hist
def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
if name == "main":
# %%
embedding_dim = 32
epoch = 1
batch_size = 4096
dropout = 0.5
seed = 1023
lr = 0.0001
print("1")
setup_seed(seed)
profile_path = './data/taobao/raw_sample.csv'
ad_path = './data/taobao/ad_feature.csv'
user_path = './data/taobao/user_profile.csv'
data = data_process(profile_path,ad_path,user_path)
data = get_user_feature(data)
sparse_features = ['userid', 'adgroup_id', 'pid', 'cms_segid', 'cms_group_id',
'final_gender_code','shopping_level', 'occupation', 'cate_id', 'campaign_id',
'customer','age_level', 'brand','pvalue_level','new_user_class_level']
dense_features = ['price']
user_sparse_features, user_dense_features = ['userid','cms_segid', 'cms_group_id','final_gender_code',
'age_level','pvalue_level','shopping_level','occupation',
'new_user_class_level',], []
item_sparse_features, item_dense_features = ['adgroup_id', 'cate_id','campaign_id','customer',
'brand','pid'], ['price']
target = ['clk_y']
item_num = len(data['adgroup_id'].value_counts()) + 5
# user_sparse_features, user_dense_features = ['reviewerID'], ['user_mean_rating']
# item_sparse_features, item_dense_features = ['asin', 'categories'], ['item_mean_rating','price']
# 1.Label Encoding for sparse features,and process sequence features
for feat in sparse_features:
lbe = LabelEncoder()
lbe.fit(data[feat])
data[feat] = lbe.transform(data[feat])
mms = MinMaxScaler(feature_range=(0, 1))
mms.fit(data[dense_features])
data[dense_features] = mms.transform(data[dense_features])
train, test = train_test_split(data, test_size=0.2)
# 2.preprocess the sequence feature
user_key2index, train_user_hist, user_maxlen = get_var_feature(train, 'user_hist')
user_varlen_feature_columns = [VarLenSparseFeat(SparseFeat('user_hist', vocabulary_size=item_num, embedding_dim=32),
maxlen=user_maxlen, combiner='mean', length_name=None)]
user_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
for i, feat in enumerate(user_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
user_dense_features]
item_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
for i, feat in enumerate(item_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
item_dense_features]
user_feature_columns+=user_varlen_feature_columns
train_model_input = {name: train[name] for name in sparse_features + dense_features}
train_model_input["user_hist"] = train_user_hist
# %%
# 4.Define Model,train,predict and evaluate
device = 'cpu'
use_cuda = True
if use_cuda and torch.cuda.is_available():
print('cuda ready...')
device = 'cuda'
es = EarlyStopping(monitor='val_auc', min_delta=0, verbose=1,
patience=3, mode='max', baseline=None)
mdckpt = ModelCheckpoint(filepath='fe_model_2.ckpt', monitor='val_auc',
mode='max', verbose=1, save_best_only=True, save_weights_only=True)
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 16, task='binary', dnn_dropout=dropout,
device=device, user_head=4,item_head=4,user_filed_size=10,item_filed_size=6)
model.compile("adam", "binary_crossentropy", metrics=['auc', 'accuracy', 'logloss']
, lr=lr)
model.fit(train_model_input, train[target].values, batch_size=batch_size,
epochs=epoch, verbose=2, validation_split=0.2, callbacks=[es, mdckpt])
# 5.preprocess the test data
model.load_state_dict(torch.load('fe_model_2.ckpt'))
# 测试时不启用 BatchNormalization 和 Dropout
model.eval()
test_user_hist = get_test_var_feature(test, 'user_hist', user_key2index, user_maxlen)
test_model_input = {name: test[name] for name in sparse_features + dense_features}
test_model_input["user_hist"] = test_user_hist
pred_ts = model.predict(test_model_input, batch_size=500)
print("test LogLoss", round(log_loss(test[target].values, pred_ts), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ts), 4))
from inttower.
Related Issues (8)
- How to deploy in real recommender systems HOT 3
- Could you share the serving code? HOT 2
- is CIR contrastive loss removed? HOT 4
- 矩阵相乘求相似度 HOT 1
- 多目标serving时的融合 HOT 1
- CUDA out of memory HOT 3
- Question about serving the model HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from inttower.