Thank you for your great job. Could you provide the training for the dataset of

Could you provide the training script for the dataset of Amazon and Alibaba? about inttower HOT 12 CLOSED

archersama commented on July 22, 2024

Could you provide the training script for the dataset of Amazon and Alibaba?

from inttower.

Comments (12)

archersama commented on July 22, 2024

add amazon and taobao script

…

------------------ 原始邮件 ------------------ 发件人: "archersama/IntTower" ***@***.***>; 发送时间: 2022年11月5日(星期六) 晚上10:47 ***@***.***>; ***@***.***>; 主题: [archersama/IntTower] Could you provide the training script for the dataset of Amazon and Alibaba? (Issue #1) Thank you for your great job. Could you provide the training script for the dataset of Amazon and Alibaba? We want to know more detail about your great job for following. Thank you very much! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

from inttower.

HqWu-HITCS commented on July 22, 2024

thank you again!

from inttower.

HqWu-HITCS commented on July 22, 2024

hi, when I run the train_taobao_IntTower.py base on the dataset download from Alibaba ads, I meet the nan problem after the first step training. I also try to use clip_grad_value but it did not work.
In addition, when I delete the total_loss.backward, there is no nan in the forward process, so I think the dataset is correct.
Could you meet the same problem when running train_taobao_IntTower.py, or give some advice for this problem?
Thank you very much!

from inttower.

archersama commented on July 22, 2024

maybe you can adjust contrast loss ， or you can delete contrast loss， it's not stable。second， reduce learn rate。
Please let me know if there is any result

from inttower.

HqWu-HITCS commented on July 22, 2024

maybe you can adjust contrast loss ， or you can delete contrast loss， it's not stable。second， reduce learn rate。 Please let me know if there is any result

Thank you for your advice.
After deleting the contrast loss, there is no nan when training.
But the performance is pooler than the result reported in the paper:
test LogLoss 0.2265
test AUC 0.6684
Could you give some advice for reproducing the result of the paper?

from inttower.

archersama commented on July 22, 2024

you could try to add user head and item head ，and fe-block fied dim，try this code
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6)

Please let me know your result

from inttower.

HqWu-HITCS commented on July 22, 2024

I try this code:
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6) and get follow result:
test LogLoss 0.3355
test AUC 0.6387
the performance is more pooler.

from inttower.

archersama commented on July 22, 2024

this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?

from inttower.

archersama commented on July 22, 2024

I'll try it later,maybe on monday I will give you my results

from inttower.

HqWu-HITCS commented on July 22, 2024

this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?

I don't modify any parameter when training, I just simply use the script: https://github.com/archersama/IntTower/blob/main/train_taobao_IntTower.py for training under GPU(V100).
I delete the contrast loss because of the nan problem.
This problem doesn't exist in amazon and movielens dataset.

from inttower.

HqWu-HITCS commented on July 22, 2024

I'll try it later,maybe on monday I will give you my results

Thank you very much. Please let me know your result

from inttower.

archersama commented on July 22, 2024

you can try this code to train inttower model, but we think user_hist feature may cause data leak, so we don't suggest this feature. And we will update taobao results on our new paper version. Inttower still benefits than dssm without this feature.

import numpy as np
import pandas as pd
import torch
import torchvision
import random
import time
from torch.utils.tensorboard import SummaryWriter
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from preprocessing.inputs import SparseFeat, DenseFeat, VarLenSparseFeat
from model.dssm import DSSM
from model.wdm import WideDeep
from model.autoint import AutoInt
from model.dcn import DCN
from model.IntTower import IntTower
from deepctr_torch.callbacks import EarlyStopping, ModelCheckpoint

def optimiz_memory(raw_data):
optimized_g2 = raw_data.copy()

g2_int = raw_data.select_dtypes(include=['int'])
converted_int = g2_int.apply(pd.to_numeric,downcast='unsigned')
optimized_g2[converted_int.columns] = converted_int

g2_float = raw_data.select_dtypes(include=['float'])
converted_float = g2_float.apply(pd.to_numeric,downcast='float')
optimized_g2[converted_float.columns] = converted_float
return optimized_g2

def optimiz_memory_profile(raw_data):
optimized_gl = raw_data.copy()

gl_int = raw_data.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
optimized_gl[converted_int.columns] = converted_int


gl_obj = raw_data.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]
optimized_gl[converted_obj.columns] = converted_obj
return optimized_gl

def data_process(profile_path,ad_path,user_path):
profile_data = pd.read_csv(profile_path)
ad_data = pd.read_csv(ad_path)
user_data = pd.read_csv(user_path)
profile_data = optimiz_memory_profile(profile_data)
ad_data = optimiz_memory(ad_data)
user_data = optimiz_memory(user_data)
profile_data.rename(columns={'user':'userid'}, inplace = True)
user_data.rename(columns={'new_user_class_level ':'new_user_class_level'}, inplace=True)
df1 = profile_data.merge(user_data, on="userid")
data = df1.merge(ad_data, on="adgroup_id")
data['brand'] = data['brand'].fillna('-1', ).astype('int32')
# data['age_level'] = data['age_level'].fillna('-1', )
# data['cms_segid'] = data['cms_segid'].fillna('-1', )
# data['cms_group_id'] = data['cms_group_id'].fillna('-1', )
# data['final_gender_code'] = data['final_gender_code'].fillna('-1', )
data['pvalue_level'] = data['pvalue_level'].fillna('-1', ).astype('int32')
# data['shopping_level'] = data['shopping_level'].fillna('-1', )
# data['occupation'] = data['occupation'].fillna('-1', )
data['new_user_class_level'] = data['new_user_class_level'].fillna('-1', ).astype('int32')
data = data.sort_values(by='time_stamp', ascending=True)
return data

def get_user_feature(data):
data_group = data[data['clk'] == 1]
data_group = data_group[['userid', 'adgroup_id']].groupby('userid').agg(list).reset_index()
data_group['user_hist'] = data_group['adgroup_id'].apply(lambda x: '|'.join([str(i) for i in x]))
data = pd.merge(data_group.drop('adgroup_id', axis=1), data, on='userid')
data_group = data[['userid', 'clk']].groupby('userid').agg('mean').reset_index()

data_group.rename(columns={'overall': 'user_mean_rating'}, inplace=True)

data = pd.merge(data_group, data, on='userid')
return data

def get_var_feature(data, col):
key2index = {}

def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            # Notice : input value 0 is a special "padding",\
            # so we do not use 0 to encode valid feature for sequence input
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

var_feature = list(map(split, data[col].values))
var_feature_length = np.array(list(map(len, var_feature)))
max_len = max(var_feature_length)
var_feature = pad_sequences(var_feature, maxlen=max_len, padding='post', )
return key2index, var_feature, max_len

def get_test_var_feature(data, col, key2index, max_len):
print("user_hist_list: \n")

def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            # Notice : input value 0 is a special "padding",
            # so we do not use 0 to encode valid feature for sequence input
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

test_hist = list(map(split, data[col].values))
test_hist = pad_sequences(test_hist, maxlen=max_len, padding='post')
return test_hist

def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True

if name == "main":
# %%

embedding_dim = 32
epoch = 1
batch_size = 4096
dropout = 0.5
seed = 1023
lr = 0.0001

print("1")

setup_seed(seed)

profile_path = './data/taobao/raw_sample.csv'
ad_path = './data/taobao/ad_feature.csv'
user_path = './data/taobao/user_profile.csv'


data = data_process(profile_path,ad_path,user_path)

data = get_user_feature(data)


sparse_features = ['userid', 'adgroup_id', 'pid', 'cms_segid', 'cms_group_id',
                   'final_gender_code','shopping_level', 'occupation', 'cate_id', 'campaign_id',
                   'customer','age_level', 'brand','pvalue_level','new_user_class_level']

dense_features = ['price']

user_sparse_features, user_dense_features = ['userid','cms_segid', 'cms_group_id','final_gender_code',
                                             'age_level','pvalue_level','shopping_level','occupation',
                                             'new_user_class_level',], []
item_sparse_features, item_dense_features = ['adgroup_id', 'cate_id','campaign_id','customer',
                                             'brand','pid'], ['price']
target = ['clk_y']
item_num = len(data['adgroup_id'].value_counts()) + 5



# user_sparse_features, user_dense_features = ['reviewerID'], ['user_mean_rating']
# item_sparse_features, item_dense_features = ['asin', 'categories'], ['item_mean_rating','price']


# 1.Label Encoding for sparse features,and process sequence features
for feat in sparse_features:
    lbe = LabelEncoder()
    lbe.fit(data[feat])
    data[feat] = lbe.transform(data[feat])
mms = MinMaxScaler(feature_range=(0, 1))
mms.fit(data[dense_features])
data[dense_features] = mms.transform(data[dense_features])

train, test = train_test_split(data, test_size=0.2)



# 2.preprocess the sequence feature
user_key2index, train_user_hist, user_maxlen = get_var_feature(train, 'user_hist')
user_varlen_feature_columns = [VarLenSparseFeat(SparseFeat('user_hist', vocabulary_size=item_num, embedding_dim=32),
                                                maxlen=user_maxlen, combiner='mean', length_name=None)]

user_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
                        for i, feat in enumerate(user_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
                                                                           user_dense_features]
item_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
                        for i, feat in enumerate(item_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
                                                                           item_dense_features]

user_feature_columns+=user_varlen_feature_columns


train_model_input = {name: train[name] for name in sparse_features + dense_features}
train_model_input["user_hist"] = train_user_hist

# %%
# 4.Define Model,train,predict and evaluate
device = 'cpu'
use_cuda = True
if use_cuda and torch.cuda.is_available():
    print('cuda ready...')
    device = 'cuda'

es = EarlyStopping(monitor='val_auc', min_delta=0, verbose=1,
                   patience=3, mode='max', baseline=None)
mdckpt = ModelCheckpoint(filepath='fe_model_2.ckpt', monitor='val_auc',
                         mode='max', verbose=1, save_best_only=True, save_weights_only=True)
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 16, task='binary', dnn_dropout=dropout,
       device=device, user_head=4,item_head=4,user_filed_size=10,item_filed_size=6)

model.compile("adam", "binary_crossentropy", metrics=['auc', 'accuracy', 'logloss']
              , lr=lr)

model.fit(train_model_input, train[target].values, batch_size=batch_size,
          epochs=epoch, verbose=2, validation_split=0.2, callbacks=[es, mdckpt])

# 5.preprocess the test data
model.load_state_dict(torch.load('fe_model_2.ckpt'))
# 测试时不启用 BatchNormalization 和 Dropout
model.eval()

test_user_hist = get_test_var_feature(test, 'user_hist', user_key2index, user_maxlen)
test_model_input = {name: test[name] for name in sparse_features + dense_features}
test_model_input["user_hist"] = test_user_hist






pred_ts = model.predict(test_model_input, batch_size=500)

print("test LogLoss", round(log_loss(test[target].values, pred_ts), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ts), 4))

from inttower.

Could you provide the training script for the dataset of Amazon and Alibaba? about inttower HOT 12 CLOSED

Comments (12)

data_group.rename(columns={'overall': 'user_mean_rating'}, inplace=True)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs