GithubHelp home page GithubHelp logo

mlfoundations / tableshift Goto Github PK

View Code? Open in Web Editor NEW
32.0 32.0 5.0 741 KB

A benchmark for distribution shift in tabular data

Home Page: http://tableshift.org

License: MIT License

Python 99.98% Dockerfile 0.02%

tableshift's People

Contributors

jpgard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tableshift's Issues

Problem to access the Food Stamps Dataset

[Dataset]
Food Stamps dataset

[Question/Issue]
When using get_dataset("acsfoodstamps"), I get a KeyError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]
Screenshot 2024-01-23 at 10 45 52

[Steps to reproduce]
Just try to get Food Stamps dataset with get_dataset("acsfoodstamps").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Problem to access the Public Health Insurance Dataset

[Dataset]
Public Health Insurance dataset

[Question/Issue]
When using get_dataset("acspubcov"), I get a Pandas ParserError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]
Screenshot 2024-01-23 at 10 19 19

[Steps to reproduce]
Just try to get Public Health Insurance dataset with get_dataset("acspubcov").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Confusion about domain_label_colname in tabletshift/core/features.py

Describe the bug
There is some confusion about domain_label_colname in tabletshift/core/features.py. What is it's purpose, and how is it different from domain_split_varname?

Is there a reason it is not added in the self.get_passthrough_columns call? In get_passthrough_columns it seems this is an optional attribute, but it is only being called from one place.

In any case, without it being added, the columns in the datasets are transformed (one hot coded or binned), and the column names are adjusted accordingly. At the point this code is being run, domain_label_colname == domain_label_varname

If domain_label_colname is a categorical attribute (as it's the case for anes dataset) then the transformed data butchers it's column name, so by the time this code is called straight after:

if domain_label_colname:
           # Case: fit the domain label transformer and apply it.
           transformed.loc[:, domain_label_colname] = \
               self.fit_transform_domain_labels(
                   transformed.loc[:, domain_label_colname])

we have exception, as the column name no longer exists (4 new columns with an extended version of that name is present). In the diabetes readmission dataset, the column which is domain_label_column is an int, so it retrains its column name when this code is called, and no exception is thrown.

    # Fit the feature transformer and apply it.
        self.fit_feature_transformer(data, train_idxs, passthrough_columns)
        transformed = self.transform_features(data)

        transformed = self._post_transform(
            transformed, cast_dtypes=post_transform_cast_dtypes)

To Reproduce
Change the dataset to 'anes' in run_expt.py and run it

Problem to access the Hypertension dataset

[Dataset]
Hypertension dataset

[Question/Issue]
When using get_dataset("brfss_blood_pressure"), I get a BadZipFile error (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]
Screenshot 2024-01-23 at 10 10 56

[Steps to reproduce]
Just try to get Hypertension dataset with get_dataset("brfss_blood_pressure").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Problem to access ASSISTments dataset

[Dataset]
ASSISTments dataset

[Question/Issue]
When using get_dataset("assistments"), I get an error (see screenshot). One dataset is downloaded, but not the proper one, I presume. Its name is skillbuilder-data-2009-2010.zip. I replicated the issue on two computers.

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]
Screenshot from 2023-12-20 22-31-28

[Steps to reproduce]
Just try to get ASSISTments dataset with get_dataset("assistments").

Temporal features for MIMIC-IV datasets

[Dataset]
Which dataset are you trying to access?
MIMIIC-IV

[Question/Issue]
I am just wondering if you include the temporal feature for the MIMIC-IV datasets? if so, would it be possible to access the pre-processed tabular data with this feature included?

I would like to examine the temporal shift patterns on this data and having hard time processing the data due to my lack domain knowledge.

Thank you

Problem to access the Unemployment Dataset

[Dataset]
Unemployment dataset

[Question/Issue]
When using get_dataset("acspubcov"), I get a KeyError (see screenshot).

[Links to relevant documentation]
If applicable, provide links to any documentation you are following to access the data.

[Screenshots]
Screenshot 2024-01-23 at 10 29 16

[Steps to reproduce]
Just try to get Unemployment dataset with get_dataset("acsunemployment").
[Steps to reproduce]
If applicable, describe steps to reproduce this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.