mlfoundations / tableshift Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 5.0 741 KB

A benchmark for distribution shift in tabular data

Home Page: http://tableshift.org

License: MIT License

Python 99.98% Dockerfile 0.02%

tableshift's People

Contributors

Stargazers

Watchers

Forkers

vnastl andrefcruz ecostadelle socialfoundations

tableshift's Issues

Problem to access the Food Stamps Dataset

[Dataset]
Food Stamps dataset

[Question/Issue]
When using get_dataset("acsfoodstamps"), I get a KeyError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]

[Steps to reproduce]
Just try to get Food Stamps dataset with get_dataset("acsfoodstamps").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Problem to access the Public Health Insurance Dataset

[Dataset]
Public Health Insurance dataset

[Question/Issue]
When using get_dataset("acspubcov"), I get a Pandas ParserError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]

[Steps to reproduce]
Just try to get Public Health Insurance dataset with get_dataset("acspubcov").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Confusion about domain_label_colname in tabletshift/core/features.py

Describe the bug
There is some confusion about domain_label_colname in tabletshift/core/features.py. What is it's purpose, and how is it different from domain_split_varname?

Is there a reason it is not added in the self.get_passthrough_columns call? In get_passthrough_columns it seems this is an optional attribute, but it is only being called from one place.

In any case, without it being added, the columns in the datasets are transformed (one hot coded or binned), and the column names are adjusted accordingly. At the point this code is being run, domain_label_colname == domain_label_varname

If domain_label_colname is a categorical attribute (as it's the case for anes dataset) then the transformed data butchers it's column name, so by the time this code is called straight after:

if domain_label_colname:
           # Case: fit the domain label transformer and apply it.
           transformed.loc[:, domain_label_colname] = \
               self.fit_transform_domain_labels(
                   transformed.loc[:, domain_label_colname])

we have exception, as the column name no longer exists (4 new columns with an extended version of that name is present). In the diabetes readmission dataset, the column which is domain_label_column is an int, so it retrains its column name when this code is called, and no exception is thrown.

    # Fit the feature transformer and apply it.

        self.fit_feature_transformer(data, train_idxs, passthrough_columns)
        transformed = self.transform_features(data)

        transformed = self._post_transform(
            transformed, cast_dtypes=post_transform_cast_dtypes)

To Reproduce
Change the dataset to 'anes' in run_expt.py and run it

OOD Performance of Assistments is 91.xx% with xgboost when label encoding 'skill_id'

As I mentioned in the title, I found that I can achieve 91% of ood performance with xgboost by preprocessing 'skill_id' with label encoding after changing the type of 'skill_id' from float to cat_type.

Why does it make a huge gap of ood performance (58->91) to change the preprocessing method?

Problem to access the Hypertension dataset

[Dataset]
Hypertension dataset

[Question/Issue]
When using get_dataset("brfss_blood_pressure"), I get a BadZipFile error (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]

[Steps to reproduce]
Just try to get Hypertension dataset with get_dataset("brfss_blood_pressure").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Problem to access ASSISTments dataset

[Dataset]
ASSISTments dataset

[Question/Issue]
When using get_dataset("assistments"), I get an error (see screenshot). One dataset is downloaded, but not the proper one, I presume. Its name is skillbuilder-data-2009-2010.zip. I replicated the issue on two computers.

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]

[Steps to reproduce]
Just try to get ASSISTments dataset with get_dataset("assistments").

Temporal features for MIMIC-IV datasets

[Dataset]
Which dataset are you trying to access?
MIMIIC-IV

[Question/Issue]
I am just wondering if you include the temporal feature for the MIMIC-IV datasets? if so, would it be possible to access the pre-processed tabular data with this feature included?

I would like to examine the temporal shift patterns on this data and having hard time processing the data due to my lack domain knowledge.

Thank you

Problem to access the Unemployment Dataset

[Dataset]
Unemployment dataset

[Question/Issue]
When using get_dataset("acspubcov"), I get a KeyError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]

[Steps to reproduce]
Just try to get Unemployment dataset with get_dataset("acsunemployment").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

mlfoundations / tableshift Goto Github PK

tableshift's People

Contributors

Stargazers

Watchers

Forkers

tableshift's Issues

Problem to access the Food Stamps Dataset

Problem to access the Public Health Insurance Dataset

Confusion about domain_label_colname in tabletshift/core/features.py

OOD Performance of Assistments is 91.xx% with xgboost when label encoding 'skill_id'

Problem to access the Hypertension dataset

Problem to access ASSISTments dataset

Temporal features for MIMIC-IV datasets

Problem to access the Unemployment Dataset

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs