GithubHelp home page GithubHelp logo

Apply fit_transform() to each of the training folds, and transform to each of the validation folds in TabularPredictor about autogluon HOT 3 OPEN

alberto-jj avatar alberto-jj commented on May 27, 2024
Apply fit_transform() to each of the training folds, and transform to each of the validation folds in TabularPredictor

from autogluon.

Comments (3)

Innixma avatar Innixma commented on May 27, 2024 1

Thanks for creating an issue @alberto-jj!

Could you provide synthetic data that enables a minimally reproducible example? (or maybe the Adult dataset used here?)

Currently AutoGluon does not separate global preprocessing stages into a per-fold basis, and instead does it across all the train data, which would technically lead to issues in your scenario. The way to do it without refactoring the AutoGluon code too much is to inject it into the AbstractModel._preprocess method (example ref), which operates on a per-fold basis as you desire.

This is a bit hacky, and a true solution I would need to think about and would be informed by being able to run a minimal reproducible example of your above script.

from autogluon.

Innixma avatar Innixma commented on May 27, 2024 1

Thanks @alberto-jj for the code and data! I will put it on my todo list and get back to you when I've given it a try

from autogluon.

alberto-jj avatar alberto-jj commented on May 27, 2024

Thanks a lot @Innixma for such a quick response,
Please find attached the code with my current flow (with fit_transform re-scaling for train and test splits).
Also, you can also find an example dataset here https://drive.google.com/file/d/1IF7jQrPzid4oSLast7apa7sAc7zHgs2k/view?usp=sharing

#pip install --no-deps recombat # Use no-deps as recombat runs on old sklearn and not scikit-learn

from sklearn.model_selection import train_test_split
from reComBat import reComBat
from autogluon.tabular import TabularPredictor

df = wide_short

#Splitting train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :], df["group"], test_size=0.3, random_state=1997, stratify=df["group"])

#Define my re-scaling model (reComBat) to be used before TabularPredictor.fit()
model = reComBat(parametric=True,
                 model='elastic_net',
                 config={'alpha': 1e-5},
                 n_jobs=7,
                 verbose=True)

#Defining adjustment covariates from X_train to perform my desired re-scaling
covars = X_train.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)


#Fitting the re-scaling (reComBat) model on the train data
model.fit(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

#Re-scaling the train data using the fitted model
transformed_train_data = model.transform(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

#Defining adjustment covariates from X_test to perform my desired re-scaling
covars = X_test.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)


#Re-scaling the test data using the fitted model
transformed_test_data = model.transform(data=X_test.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))

X_train_harmonized = X_train.iloc[:, :5].join(transformed_train_data.iloc[:, :])
X_test_harmonized = X_test.iloc[:, :5].join(transformed_test_data.iloc[:, :])


label = 'group'

# List of columns to drop
columns_to_drop = ['participant_id', 'age', 'gender', 'center']

# Drop the specified columns to fit automl
train = X_train_harmonized.drop(columns=columns_to_drop)
time = 3600 * 4

save_path = 'automl/results/harmonized_featurest' 
predictor = TabularPredictor(label=label, path=save_path, eval_metric='balanced_accuracy',
                             sample_weight='auto_weight').fit(train, presets='best_quality', auto_stack=True, time_limit=time)  

#Evaluate the best model in the Re-scaled hold-out test set.
test_data = X_test_harmonized.drop(columns=columns_to_drop)
y_test = test_data[label]
y_pred = predictor.predict(test_data.drop(columns=[label]))
predictor.evaluate(test_data)


It would be indeed very desirable for me to use such type of feature preprocessing (re-scaling) across training/validation folds, and it seems that not correcting for those effects would lead to potential data leakage (at least as proved for multi-site study designs with neuroimaging features) https://www.nature.com/articles/s41597-023-02421-7

It would also be nice to keep the default preprocessing/scalings made by Autogluon after my desired re-scaling, as I guess your preprocessing should favor fewer issues in some models due to differences in scale across features.

Once again, thanks a lot for the help!

edit: pip install trick to get recombat working without handling issue with old pypi sklearn vs new scikit-learn

from autogluon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.