GithubHelp home page GithubHelp logo

Comments (12)

archersama avatar archersama commented on July 22, 2024

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

thank you again!

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

hi, when I run the train_taobao_IntTower.py base on the dataset download from Alibaba ads, I meet the nan problem after the first step training. I also try to use clip_grad_value but it did not work.
In addition, when I delete the total_loss.backward, there is no nan in the forward process, so I think the dataset is correct.
Could you meet the same problem when running train_taobao_IntTower.py, or give some advice for this problem?
Thank you very much!

from inttower.

archersama avatar archersama commented on July 22, 2024

maybe you can adjust contrast loss , or you can delete contrast loss, it's not stable。second, reduce learn rate。
Please let me know if there is any result

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

maybe you can adjust contrast loss , or you can delete contrast loss, it's not stable。second, reduce learn rate。 Please let me know if there is any result

Thank you for your advice.
After deleting the contrast loss, there is no nan when training.
But the performance is pooler than the result reported in the paper:
test LogLoss 0.2265
test AUC 0.6684
Could you give some advice for reproducing the result of the paper?

from inttower.

archersama avatar archersama commented on July 22, 2024

you could try to add user head and item head ,and fe-block fied dim,try this code
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6)

Please let me know your result

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

I try this code:
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 64, task='binary', dnn_dropout=dropout,
device=device, user_head=16,item_head=16,user_filed_size=9,item_filed_size=6) and get follow result:
test LogLoss 0.3355
test AUC 0.6387
the performance is more pooler.

from inttower.

archersama avatar archersama commented on July 22, 2024

this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?

from inttower.

archersama avatar archersama commented on July 22, 2024

I'll try it later,maybe on monday I will give you my results

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

this is interesting, maybe it's overfit , do you try to reduce learn rate? maybe you can report your batch size and hardware like GPU information? And this problem exists in amazon and movielens dataset?

I don't modify any parameter when training, I just simply use the script: https://github.com/archersama/IntTower/blob/main/train_taobao_IntTower.py for training under GPU(V100).
I delete the contrast loss because of the nan problem.
This problem doesn't exist in amazon and movielens dataset.

from inttower.

HqWu-HITCS avatar HqWu-HITCS commented on July 22, 2024

I'll try it later,maybe on monday I will give you my results

Thank you very much. Please let me know your result

from inttower.

archersama avatar archersama commented on July 22, 2024

you can try this code to train inttower model, but we think user_hist feature may cause data leak, so we don't suggest this feature. And we will update taobao results on our new paper version. Inttower still benefits than dssm without this feature.

import numpy as np
import pandas as pd
import torch
import torchvision
import random
import time
from torch.utils.tensorboard import SummaryWriter
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from preprocessing.inputs import SparseFeat, DenseFeat, VarLenSparseFeat
from model.dssm import DSSM
from model.wdm import WideDeep
from model.autoint import AutoInt
from model.dcn import DCN
from model.IntTower import IntTower
from deepctr_torch.callbacks import EarlyStopping, ModelCheckpoint

def optimiz_memory(raw_data):
optimized_g2 = raw_data.copy()

g2_int = raw_data.select_dtypes(include=['int'])
converted_int = g2_int.apply(pd.to_numeric,downcast='unsigned')
optimized_g2[converted_int.columns] = converted_int

g2_float = raw_data.select_dtypes(include=['float'])
converted_float = g2_float.apply(pd.to_numeric,downcast='float')
optimized_g2[converted_float.columns] = converted_float
return optimized_g2

def optimiz_memory_profile(raw_data):
optimized_gl = raw_data.copy()

gl_int = raw_data.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
optimized_gl[converted_int.columns] = converted_int


gl_obj = raw_data.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]
optimized_gl[converted_obj.columns] = converted_obj
return optimized_gl

def data_process(profile_path,ad_path,user_path):
profile_data = pd.read_csv(profile_path)
ad_data = pd.read_csv(ad_path)
user_data = pd.read_csv(user_path)
profile_data = optimiz_memory_profile(profile_data)
ad_data = optimiz_memory(ad_data)
user_data = optimiz_memory(user_data)
profile_data.rename(columns={'user':'userid'}, inplace = True)
user_data.rename(columns={'new_user_class_level ':'new_user_class_level'}, inplace=True)
df1 = profile_data.merge(user_data, on="userid")
data = df1.merge(ad_data, on="adgroup_id")
data['brand'] = data['brand'].fillna('-1', ).astype('int32')
# data['age_level'] = data['age_level'].fillna('-1', )
# data['cms_segid'] = data['cms_segid'].fillna('-1', )
# data['cms_group_id'] = data['cms_group_id'].fillna('-1', )
# data['final_gender_code'] = data['final_gender_code'].fillna('-1', )
data['pvalue_level'] = data['pvalue_level'].fillna('-1', ).astype('int32')
# data['shopping_level'] = data['shopping_level'].fillna('-1', )
# data['occupation'] = data['occupation'].fillna('-1', )
data['new_user_class_level'] = data['new_user_class_level'].fillna('-1', ).astype('int32')
data = data.sort_values(by='time_stamp', ascending=True)
return data

def get_user_feature(data):
data_group = data[data['clk'] == 1]
data_group = data_group[['userid', 'adgroup_id']].groupby('userid').agg(list).reset_index()
data_group['user_hist'] = data_group['adgroup_id'].apply(lambda x: '|'.join([str(i) for i in x]))
data = pd.merge(data_group.drop('adgroup_id', axis=1), data, on='userid')
data_group = data[['userid', 'clk']].groupby('userid').agg('mean').reset_index()

data_group.rename(columns={'overall': 'user_mean_rating'}, inplace=True)

data = pd.merge(data_group, data, on='userid')
return data

def get_var_feature(data, col):
key2index = {}

def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            # Notice : input value 0 is a special "padding",\
            # so we do not use 0 to encode valid feature for sequence input
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

var_feature = list(map(split, data[col].values))
var_feature_length = np.array(list(map(len, var_feature)))
max_len = max(var_feature_length)
var_feature = pad_sequences(var_feature, maxlen=max_len, padding='post', )
return key2index, var_feature, max_len

def get_test_var_feature(data, col, key2index, max_len):
print("user_hist_list: \n")

def split(x):
    key_ans = x.split('|')
    for key in key_ans:
        if key not in key2index:
            # Notice : input value 0 is a special "padding",
            # so we do not use 0 to encode valid feature for sequence input
            key2index[key] = len(key2index) + 1
    return list(map(lambda x: key2index[x], key_ans))

test_hist = list(map(split, data[col].values))
test_hist = pad_sequences(test_hist, maxlen=max_len, padding='post')
return test_hist

def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True

if name == "main":
# %%

embedding_dim = 32
epoch = 1
batch_size = 4096
dropout = 0.5
seed = 1023
lr = 0.0001

print("1")

setup_seed(seed)

profile_path = './data/taobao/raw_sample.csv'
ad_path = './data/taobao/ad_feature.csv'
user_path = './data/taobao/user_profile.csv'


data = data_process(profile_path,ad_path,user_path)

data = get_user_feature(data)


sparse_features = ['userid', 'adgroup_id', 'pid', 'cms_segid', 'cms_group_id',
                   'final_gender_code','shopping_level', 'occupation', 'cate_id', 'campaign_id',
                   'customer','age_level', 'brand','pvalue_level','new_user_class_level']

dense_features = ['price']

user_sparse_features, user_dense_features = ['userid','cms_segid', 'cms_group_id','final_gender_code',
                                             'age_level','pvalue_level','shopping_level','occupation',
                                             'new_user_class_level',], []
item_sparse_features, item_dense_features = ['adgroup_id', 'cate_id','campaign_id','customer',
                                             'brand','pid'], ['price']
target = ['clk_y']
item_num = len(data['adgroup_id'].value_counts()) + 5



# user_sparse_features, user_dense_features = ['reviewerID'], ['user_mean_rating']
# item_sparse_features, item_dense_features = ['asin', 'categories'], ['item_mean_rating','price']


# 1.Label Encoding for sparse features,and process sequence features
for feat in sparse_features:
    lbe = LabelEncoder()
    lbe.fit(data[feat])
    data[feat] = lbe.transform(data[feat])
mms = MinMaxScaler(feature_range=(0, 1))
mms.fit(data[dense_features])
data[dense_features] = mms.transform(data[dense_features])

train, test = train_test_split(data, test_size=0.2)



# 2.preprocess the sequence feature
user_key2index, train_user_hist, user_maxlen = get_var_feature(train, 'user_hist')
user_varlen_feature_columns = [VarLenSparseFeat(SparseFeat('user_hist', vocabulary_size=item_num, embedding_dim=32),
                                                maxlen=user_maxlen, combiner='mean', length_name=None)]

user_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
                        for i, feat in enumerate(user_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
                                                                           user_dense_features]
item_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=embedding_dim)
                        for i, feat in enumerate(item_sparse_features)] + [DenseFeat(feat, 1, ) for feat in
                                                                           item_dense_features]

user_feature_columns+=user_varlen_feature_columns


train_model_input = {name: train[name] for name in sparse_features + dense_features}
train_model_input["user_hist"] = train_user_hist

# %%
# 4.Define Model,train,predict and evaluate
device = 'cpu'
use_cuda = True
if use_cuda and torch.cuda.is_available():
    print('cuda ready...')
    device = 'cuda'

es = EarlyStopping(monitor='val_auc', min_delta=0, verbose=1,
                   patience=3, mode='max', baseline=None)
mdckpt = ModelCheckpoint(filepath='fe_model_2.ckpt', monitor='val_auc',
                         mode='max', verbose=1, save_best_only=True, save_weights_only=True)
model = IntTower(user_feature_columns, item_feature_columns, field_dim= 16, task='binary', dnn_dropout=dropout,
       device=device, user_head=4,item_head=4,user_filed_size=10,item_filed_size=6)

model.compile("adam", "binary_crossentropy", metrics=['auc', 'accuracy', 'logloss']
              , lr=lr)

model.fit(train_model_input, train[target].values, batch_size=batch_size,
          epochs=epoch, verbose=2, validation_split=0.2, callbacks=[es, mdckpt])

# 5.preprocess the test data
model.load_state_dict(torch.load('fe_model_2.ckpt'))
# 测试时不启用 BatchNormalization 和 Dropout
model.eval()

test_user_hist = get_test_var_feature(test, 'user_hist', user_key2index, user_maxlen)
test_model_input = {name: test[name] for name in sparse_features + dense_features}
test_model_input["user_hist"] = test_user_hist






pred_ts = model.predict(test_model_input, batch_size=500)

print("test LogLoss", round(log_loss(test[target].values, pred_ts), 4))
print("test AUC", round(roc_auc_score(test[target].values, pred_ts), 4))

from inttower.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.