GithubHelp home page GithubHelp logo

Missingness handling about synthcity HOT 5 CLOSED

dionman avatar dionman commented on May 28, 2024
Missingness handling

from synthcity.

Comments (5)

robsdavis avatar robsdavis commented on May 28, 2024

Hi @dionman, Currently synthcity assumes that there is no missing data, and so fitting a tabular model on a dataset with missing values will throw an error. Something like ValueError: Input X contains NaN. The library thus assumes that you have dealt with the missingness prior to fitting a model with the dataloader with third party methods (such as HyperImpute, as you suggest).
I agree a tutorial on combining with HyperImpute would be very useful.

from synthcity.

dionman avatar dionman commented on May 28, 2024

I'm pretty sure the below won't throw an error. Is the column type casting due to the missingness symbol undesirable?

input_data.csv

age,sex,on thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4U,FTI measured,FTI,TBG measured,referral source,binaryClass
41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,SVHC,P
23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,other,P
46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,other,P
70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,other,P
70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,SVI,P
18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,other,P
59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,other,P
80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,SVI,P
66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,SVI,P
68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,SVI,P
84,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,1.1,t,2.2,t,115,t,0.95,t,121,f,SVI,P
67,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,152,t,0.99,t,153,f,other,P
71,F,f,f,f,t,f,f,f,f,t,f,f,f,f,f,t,0.03,t,3.8,t,171,t,1.13,t,151,f,other,P
59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.8,t,1.7,t,97,t,0.91,t,107,f,SVI,P
28,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,3.3,t,1.8,t,109,t,0.91,t,119,f,SVHC,P
65,F,f,f,f,f,f,f,f,t,f,f,f,f,f,f,t,12,f,?,t,99,t,1.14,t,87,f,other,N
42,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.2,t,1.8,t,70,t,0.86,t,81,f,other,P
63,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.5,t,1.2,t,117,t,0.96,t,121,f,SVI,P
from synthcity.plugins import Plugins
import pandas as pd

syn_model = Plugins().get("ctgan")
X = pd.read_csv("input_data.csv")
syn_model.fit(X)

from synthcity.

robsdavis avatar robsdavis commented on May 28, 2024

Sorry, I was assuming missing data was represented by "NaN" values in my reply above (and that is required in order to throw the error above). You are correct. If you use a missingness label, such a "?" then numerical columns will type cast to "object"s. This means that when they are encoded in any of the models fit methods e.g. here for tabular gan. Columns with missing values will be treated as categorical with "?" as one of the categories. Treating the missing values as another equivalent category in a column is not desirable. So, it should be made more obvious somewhere that datasets with missing values are not supported inherently by the generative methods and you need to impute them separately first. Your suggested tutorial would help, but I may also add as a comment somewhere harder for users to miss.

from synthcity.

dionman avatar dionman commented on May 28, 2024

thanks for clarifying! which numeric types are currently supported? is there support for int, or should all numerical columns be casted to float?

from synthcity.

robsdavis avatar robsdavis commented on May 28, 2024

No problem. And nope, there is no need to cast to float. Synthcity is designed to handle int, uint, float, and datetime numerical data types.

from synthcity.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.