mlfoundations / tableshift Goto Github PK
View Code? Open in Web Editor NEWA benchmark for distribution shift in tabular data
Home Page: http://tableshift.org
License: MIT License
A benchmark for distribution shift in tabular data
Home Page: http://tableshift.org
License: MIT License
[Dataset]
Food Stamps dataset
[Question/Issue]
When using get_dataset("acsfoodstamps"), I get a KeyError (see screenshot).
[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.
[Steps to reproduce]
Just try to get Food Stamps dataset with get_dataset("acsfoodstamps").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.
[Dataset]
Public Health Insurance dataset
[Question/Issue]
When using get_dataset("acspubcov"), I get a Pandas ParserError (see screenshot).
[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.
[Steps to reproduce]
Just try to get Public Health Insurance dataset with get_dataset("acspubcov").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.
Describe the bug
There is some confusion about domain_label_colname
in tabletshift/core/features.py
. What is it's purpose, and how is it different from domain_split_varname
?
Is there a reason it is not added in the self.get_passthrough_columns
call? In get_passthrough_columns
it seems this is an optional attribute, but it is only being called from one place.
In any case, without it being added, the columns in the datasets are transformed (one hot coded or binned), and the column names are adjusted accordingly. At the point this code is being run, domain_label_colname == domain_label_varname
If domain_label_colname
is a categorical attribute (as it's the case for anes dataset) then the transformed data butchers it's column name, so by the time this code is called straight after:
if domain_label_colname:
# Case: fit the domain label transformer and apply it.
transformed.loc[:, domain_label_colname] = \
self.fit_transform_domain_labels(
transformed.loc[:, domain_label_colname])
we have exception, as the column name no longer exists (4 new columns with an extended version of that name is present). In the diabetes readmission dataset, the column which is domain_label_column is an int, so it retrains its column name when this code is called, and no exception is thrown.
# Fit the feature transformer and apply it.
self.fit_feature_transformer(data, train_idxs, passthrough_columns)
transformed = self.transform_features(data)
transformed = self._post_transform(
transformed, cast_dtypes=post_transform_cast_dtypes)
To Reproduce
Change the dataset to 'anes' in run_expt.py and run it
As I mentioned in the title, I found that I can achieve 91% of ood performance with xgboost by preprocessing 'skill_id' with label encoding after changing the type of 'skill_id' from float to cat_type.
Why does it make a huge gap of ood performance (58->91) to change the preprocessing method?
[Dataset]
Hypertension dataset
[Question/Issue]
When using get_dataset("brfss_blood_pressure"), I get a BadZipFile error (see screenshot).
[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.
[Steps to reproduce]
Just try to get Hypertension dataset with get_dataset("brfss_blood_pressure").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.
[Dataset]
ASSISTments dataset
[Question/Issue]
When using get_dataset("assistments")
, I get an error (see screenshot). One dataset is downloaded, but not the proper one, I presume. Its name is skillbuilder-data-2009-2010.zip. I replicated the issue on two computers.
[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.
[Steps to reproduce]
Just try to get ASSISTments dataset with get_dataset("assistments")
.
[Dataset]
Which dataset are you trying to access?
MIMIIC-IV
[Question/Issue]
I am just wondering if you include the temporal feature for the MIMIC-IV datasets? if so, would it be possible to access the pre-processed tabular data with this feature included?
I would like to examine the temporal shift patterns on this data and having hard time processing the data due to my lack domain knowledge.
Thank you
[Dataset]
Unemployment dataset
[Question/Issue]
When using get_dataset("acspubcov"), I get a KeyError (see screenshot).
[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.
[Steps to reproduce]
Just try to get Unemployment dataset with get_dataset("acsunemployment").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.