Hi, I am currently running Autogluon TabularPredictor on a dataset previously spli

Thanks for creating an issue <a class="user-mention notranslate" data-hovercard-type="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Apply fit_transform() to each of the training folds, and transform to each of the validation folds in TabularPredictor about autogluon HOT 3 OPEN

alberto-jj commented on May 27, 2024

Apply fit_transform() to each of the training folds, and transform to each of the validation folds in TabularPredictor

from autogluon.

Comments (3)

Innixma commented on May 27, 2024 1

Thanks for creating an issue @alberto-jj!

Could you provide synthetic data that enables a minimally reproducible example? (or maybe the Adult dataset used here?)

Currently AutoGluon does not separate global preprocessing stages into a per-fold basis, and instead does it across all the train data, which would technically lead to issues in your scenario. The way to do it without refactoring the AutoGluon code too much is to inject it into the AbstractModel._preprocess method (example ref), which operates on a per-fold basis as you desire.

This is a bit hacky, and a true solution I would need to think about and would be informed by being able to run a minimal reproducible example of your above script.

from autogluon.

Innixma commented on May 27, 2024 1

Thanks @alberto-jj for the code and data! I will put it on my todo list and get back to you when I've given it a try

from autogluon.

alberto-jj commented on May 27, 2024

Thanks a lot @Innixma for such a quick response,
Please find attached the code with my current flow (with fit_transform re-scaling for train and test splits).
Also, you can also find an example dataset here https://drive.google.com/file/d/1IF7jQrPzid4oSLast7apa7sAc7zHgs2k/view?usp=sharing

#pip install --no-deps recombat # Use no-deps as recombat runs on old sklearn and not scikit-learn

from sklearn.model_selection import train_test_split
from reComBat import reComBat
from autogluon.tabular import TabularPredictor

df = wide_short

#Splitting train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :], df["group"], test_size=0.3, random_state=1997, stratify=df["group"])

#Define my re-scaling model (reComBat) to be used before TabularPredictor.fit()
model = reComBat(parametric=True,
                 model='elastic_net',
                 config={'alpha': 1e-5},
                 n_jobs=7,
                 verbose=True)

#Defining adjustment covariates from X_train to perform my desired re-scaling
covars = X_train.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)


#Fitting the re-scaling (reComBat) model on the train data
model.fit(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

#Re-scaling the train data using the fitted model
transformed_train_data = model.transform(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

#Defining adjustment covariates from X_test to perform my desired re-scaling
covars = X_test.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)


#Re-scaling the test data using the fitted model
transformed_test_data = model.transform(data=X_test.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

X_train_harmonized = X_train.iloc[:, :5].join(transformed_train_data.iloc[:, :])
X_test_harmonized = X_test.iloc[:, :5].join(transformed_test_data.iloc[:, :])


label = 'group'

# List of columns to drop
columns_to_drop = ['participant_id', 'age', 'gender', 'center']

# Drop the specified columns to fit automl
train = X_train_harmonized.drop(columns=columns_to_drop)
time = 3600 * 4

save_path = 'automl/results/harmonized_featurest' 
predictor = TabularPredictor(label=label, path=save_path, eval_metric='balanced_accuracy',
                             sample_weight='auto_weight').fit(train, presets='best_quality', auto_stack=True, time_limit=time)  

#Evaluate the best model in the Re-scaled hold-out test set.
test_data = X_test_harmonized.drop(columns=columns_to_drop)
y_test = test_data[label]
y_pred = predictor.predict(test_data.drop(columns=[label]))
predictor.evaluate(test_data)

It would be indeed very desirable for me to use such type of feature preprocessing (re-scaling) across training/validation folds, and it seems that not correcting for those effects would lead to potential data leakage (at least as proved for multi-site study designs with neuroimaging features) https://www.nature.com/articles/s41597-023-02421-7

It would also be nice to keep the default preprocessing/scalings made by Autogluon after my desired re-scaling, as I guess your preprocessing should favor fewer issues in some models due to differences in scale across features.

Once again, thanks a lot for the help!

edit: pip install trick to get recombat working without handling issue with old pypi sklearn vs new scikit-learn

from autogluon.

Apply fit_transform() to each of the training folds, and transform to each of the validation folds in TabularPredictor about autogluon HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs