Comments (3)
Thanks for creating an issue @alberto-jj!
Could you provide synthetic data that enables a minimally reproducible example? (or maybe the Adult dataset used here?)
Currently AutoGluon does not separate global preprocessing stages into a per-fold basis, and instead does it across all the train data, which would technically lead to issues in your scenario. The way to do it without refactoring the AutoGluon code too much is to inject it into the AbstractModel._preprocess method (example ref), which operates on a per-fold basis as you desire.
This is a bit hacky, and a true solution I would need to think about and would be informed by being able to run a minimal reproducible example of your above script.
from autogluon.
Thanks @alberto-jj for the code and data! I will put it on my todo list and get back to you when I've given it a try
from autogluon.
Thanks a lot @Innixma for such a quick response,
Please find attached the code with my current flow (with fit_transform re-scaling for train and test splits).
Also, you can also find an example dataset here https://drive.google.com/file/d/1IF7jQrPzid4oSLast7apa7sAc7zHgs2k/view?usp=sharing
#pip install --no-deps recombat # Use no-deps as recombat runs on old sklearn and not scikit-learn
from sklearn.model_selection import train_test_split
from reComBat import reComBat
from autogluon.tabular import TabularPredictor
df = wide_short
#Splitting train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :], df["group"], test_size=0.3, random_state=1997, stratify=df["group"])
#Define my re-scaling model (reComBat) to be used before TabularPredictor.fit()
model = reComBat(parametric=True,
model='elastic_net',
config={'alpha': 1e-5},
n_jobs=7,
verbose=True)
#Defining adjustment covariates from X_train to perform my desired re-scaling
covars = X_train.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)
#Fitting the re-scaling (reComBat) model on the train data
model.fit(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))
#Re-scaling the train data using the fitted model
transformed_train_data = model.transform(data=X_train.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))
#Defining adjustment covariates from X_test to perform my desired re-scaling
covars = X_test.iloc[:, 1:5].copy()
covars.rename(columns={'age': 'age_numerical'}, inplace=True)
covars.center = covars.center.astype(str)
covars.group = covars.group.astype(str)
covars.gender = covars.gender.astype(str)
#Re-scaling the test data using the fitted model
transformed_test_data = model.transform(data=X_test.iloc[:,5:], batches=covars.center,X=covars.drop(['center'],axis=1))
X_train_harmonized = X_train.iloc[:, :5].join(transformed_train_data.iloc[:, :])
X_test_harmonized = X_test.iloc[:, :5].join(transformed_test_data.iloc[:, :])
label = 'group'
# List of columns to drop
columns_to_drop = ['participant_id', 'age', 'gender', 'center']
# Drop the specified columns to fit automl
train = X_train_harmonized.drop(columns=columns_to_drop)
time = 3600 * 4
save_path = 'automl/results/harmonized_featurest'
predictor = TabularPredictor(label=label, path=save_path, eval_metric='balanced_accuracy',
sample_weight='auto_weight').fit(train, presets='best_quality', auto_stack=True, time_limit=time)
#Evaluate the best model in the Re-scaled hold-out test set.
test_data = X_test_harmonized.drop(columns=columns_to_drop)
y_test = test_data[label]
y_pred = predictor.predict(test_data.drop(columns=[label]))
predictor.evaluate(test_data)
It would be indeed very desirable for me to use such type of feature preprocessing (re-scaling) across training/validation folds, and it seems that not correcting for those effects would lead to potential data leakage (at least as proved for multi-site study designs with neuroimaging features) https://www.nature.com/articles/s41597-023-02421-7
It would also be nice to keep the default preprocessing/scalings made by Autogluon after my desired re-scaling, as I guess your preprocessing should favor fewer issues in some models due to differences in scale across features.
Once again, thanks a lot for the help!
edit: pip install trick to get recombat working without handling issue with old pypi sklearn vs new scikit-learn
from autogluon.
Related Issues (20)
- Accessing probabilities of bagged models HOT 3
- Failed to load cask: libomp.rb HOT 2
- networkx.exception.NetworkXError: The node ETS is not in the digraph. HOT 5
- [BUG] HOT 1
- NeuralNetTorch Hyperparameter Tuning Fails with URI Scheme Error in PyArrow
- Enabling Visualization for Real-Time and Post-Hoc Monitoring of Training and Evaluation Metrics
- [tabular] Detailed Error Messages For Invalid Data Types
- Create SECURITY.md HOT 2
- [BUG]when convert_frequency meet problems
- [timeseries] When saving predictor to a folder with an existing predictor, delete the old predictor HOT 1
- [BUG] GPU training not working with non-default tabular presets HOT 2
- Add social media preview for auto.gluon.ai
- I hope to use SAM's own prompter to input point and label for inference
- About training Conv-LoRA on semantic segmentation datasets
- [cloud] Improve clarity on local_to_cloud support
- [BUG] [timeseries] Some forecasting models fail during fit if an S3 path is used HOT 2
- [timeseries] Parallel Training of Time series Predictor HOT 1
- [tabular] Improve GPU Support
- [tabular] Parallel Model Training
- [tabular] Add log-scaling to regression for appropriate metrics
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autogluon.