Comments (5)
Hi @dionman, Currently synthcity assumes that there is no missing data, and so fitting a tabular model on a dataset with missing values will throw an error. Something like ValueError: Input X contains NaN.
The library thus assumes that you have dealt with the missingness prior to fitting a model with the dataloader with third party methods (such as HyperImpute, as you suggest).
I agree a tutorial on combining with HyperImpute would be very useful.
from synthcity.
I'm pretty sure the below won't throw an error. Is the column type casting due to the missingness symbol undesirable?
input_data.csv
age,sex,on thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4U,FTI measured,FTI,TBG measured,referral source,binaryClass
41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,SVHC,P
23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,other,P
46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,other,P
70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,other,P
70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,SVI,P
18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,other,P
59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,other,P
80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,SVI,P
66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,SVI,P
68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,SVI,P
84,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,1.1,t,2.2,t,115,t,0.95,t,121,f,SVI,P
67,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,152,t,0.99,t,153,f,other,P
71,F,f,f,f,t,f,f,f,f,t,f,f,f,f,f,t,0.03,t,3.8,t,171,t,1.13,t,151,f,other,P
59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.8,t,1.7,t,97,t,0.91,t,107,f,SVI,P
28,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,3.3,t,1.8,t,109,t,0.91,t,119,f,SVHC,P
65,F,f,f,f,f,f,f,f,t,f,f,f,f,f,f,t,12,f,?,t,99,t,1.14,t,87,f,other,N
42,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.2,t,1.8,t,70,t,0.86,t,81,f,other,P
63,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.5,t,1.2,t,117,t,0.96,t,121,f,SVI,P
from synthcity.plugins import Plugins
import pandas as pd
syn_model = Plugins().get("ctgan")
X = pd.read_csv("input_data.csv")
syn_model.fit(X)
from synthcity.
Sorry, I was assuming missing data was represented by "NaN" values in my reply above (and that is required in order to throw the error above). You are correct. If you use a missingness label, such a "?" then numerical columns will type cast to "object"s. This means that when they are encoded in any of the models fit methods e.g. here for tabular gan. Columns with missing values will be treated as categorical with "?" as one of the categories. Treating the missing values as another equivalent category in a column is not desirable. So, it should be made more obvious somewhere that datasets with missing values are not supported inherently by the generative methods and you need to impute them separately first. Your suggested tutorial would help, but I may also add as a comment somewhere harder for users to miss.
from synthcity.
thanks for clarifying! which numeric types are currently supported? is there support for int
, or should all numerical columns be casted to float
?
from synthcity.
No problem. And nope, there is no need to cast to float. Synthcity is designed to handle int, uint, float, and datetime numerical data types.
from synthcity.
Related Issues (20)
- Add a plugin registry
- Include Be_great and arfpy plugins in main code
- Airfoil dataset no longer available from source
- Upgrade PyTorch requirements from ">=1.10.0,<2.0" to "2.0.1" HOT 3
- Plugin decaf load failed HOT 2
- Representation of categorical features
- "Fatal error: 'omp.h' file not found" when importing plugins
- Parameter `delta` is unavailable in `PATEGan`
- AttackEvaluator classes give back empty dictionary
- Missing documentation items
- Question about the implements of Beta_coverage
- Increase upper bound on numpy version
- Does this library generate synthetic EEG signals, MRI images? HOT 1
- Incompatibility between SynthCity and scikit_learn 1.4. With workaround and possible fix HOT 1
- eval does not share encoding transformers
- PydanticUserError in schema.py HOT 1
- Issue with networkx 3.0
- Benchmark problem on survivalgan HOT 1
- importing plugins fails HOT 2
- Macos-latest workflow HTTPS error HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from synthcity.