vanderschaarlab / synthcity Goto Github PK

View Code? Open in Web Editor NEW

354.0 12.0 47.0 7.06 MB

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.

Home Page: https://www.vanderschaar-lab.com/

License: Apache License 2.0

Python 81.25% Jupyter Notebook 18.75%

pytorch tabular-data privacy machine-learning generative-model data-augmentation fairness-ml synthetic-data

synthcity's People

Contributors

Stargazers

Watchers

synthcity's Issues

Import rdt fail

Running (first commands in README)
f```
rom synthcity.plugins import Plugins
Plugins(categories=["generic"]).list()

results in error:
`cannot import name 'ClusterBasedNormalizer' from 'rdt.transformers`

[Time Series] Add Autoregressive Model

Add preprocessing pipeline

Use hyperimpute for NaNs
Compress dataset

Bayesian network

I checked all the mainstream Bayesian network libraries in Python but none of them supports continuous or mixed data types.

Hence, I propose to do the following:

Discretize the continuous variable, e.g. using sklearn KBinsDiscretizer
Fit the BN on discretized data
During sampling, first generate the discrete bin id using BN, then randomly sample a continuous value in the bin range.

[Plugins] Use constraints at training time

Use constraints at training time

[Metrics] Specify when a metric is for the marginal or joint distribution

[Bug] IntegerDistribution returns float

Sampling from an IntegerDistribution returns a float type. This causes the batch_size to be float, which will trigger an exception when running tvae.

syn_model = Plugins().get("tvae")
params = syn_model.hyperparameter_space()
param_val = [x.sample()[0] for x in params]
param_name = [x.name for x in params]

param_dict = dict(zip(param_name, param_val))
isinstance(param_dict['batch_size'], int)

returns false.

Passing the float batch size triggers the following exception when running tvae

[2022-06-08T19:50:33.558025+0000][297][CRITICAL] [tvae][param 19][take 0] failed: batch_size should be a positive integer value, but got batch_size=150

[Install] pytorch_wavelets dependency

This is a low priority issue. Fix it only when you have time.

Synthcity now depends on the library pytorch_wavelets. This library cannot be automatically pip installed; instead one has to download it from github and then install it in the directory. This might make it difficult for new users to install.

Is there a possible workaround? If no, we need to update the installation guide.

Also please add PyWavelets to the dependency.

Add RadialGAN

Input format for time series data

Question

Which input format is required for time series data?

Further Information

Dear SynthCity developers, I really like your work and wanted to test out the package on my own time series dataset. I have a dataset with phone data consisting of passive sensing, sampled daily with some days missing for some individuals. Number of days of collected data varies between individuals. To familiarize me with the required input format I went through the PBC dataset.

loader = TimeSeriesDataLoader(temporal_data=temporal,
                                                    observation_times=temporal_horizons,
                                                    outcome=outcome_surv,
                                                    static_data=static_surv)

As far as I understand, temporal_data is a list of dataframes of variable length containing variables of interest and time as an index column. The observation_times is a list of lists with the timestamps for each observation in a list. outcome is a tuple with two series of outcomes, and static_data is just a dataframe.

If I understand correctly I'd have to split temporal features into multiple dataframes, make timestamps the index and put these in a list. Then I'd generate lists of the timestamps for each dataframe add them to the list of observation times and select a list of outcome and static features with the same ordering as the two lists. Before I mess up the analysis, is there anything I'm missing here?

If this works out I'd be willing to write a short tutorial on this - could help other labs import their own data.

[Metrics] Add membership inference attack

CTGAN improved inference

Integrate jaxtyping for advanced parameter validation

Description

Right now, synthcity uses pydantic for validating the parameters for various functions.

An improvement on top of that would be to integrate jaxtyping, which allows for validating tensor shapes as well
jaxtyping supports PyTorch tensors and numpy arrays.

Example

from jaxtyping import Array, Float, PyTree

# Accepts floating-point 2D arrays with matching dimensions
def matrix_multiply(x: Float[Array, "dim1 dim2"],
                    y: Float[Array, "dim2 dim3"]
                  ) -> Float[Array, "dim1 dim3"]:
    ...

def accepts_pytree_of_ints(x: PyTree[int]):
    ...

def accepts_pytree_of_arrays(x: PyTree[Float[Array, "batch c1 c2"]]):
    ...

https://github.com/google/jaxtyping

Dataloader train_size argument not passed

Custom dataloaders (e.g. GenericDataLoader) do not pass train_size to the DataLoader initialisation (e.g. "train_size=train_size," missing in line 258, etc), thus dataloaders are always using default train_size=0.8

[Metrics] Review histogram binning/allow more than 10 bins

Review histogram binning/allow more than 10 bins

[Core] Skeleton architecture

[Notebook] Benchmark argument change

Low priority issue related to notebooks.

In commit #42 the Benchmarks.evaluate takes

tests: List[Tuple[str, str, dict]], # test name, plugin name, plugin args

But in the notebooks, it takes

plugins: List,

Need to update notebooks to reflect the change.

License in setup.cfg

Shall we change the license in setup.cfg to Apache 2.0 (for consistency)? It's currently MIT.

synthcity/setup.cfg

Line 9 in 9a74dd6

license = MIT

Encode datetime features

[Metrics] Feature importance distance

Rank distance(Spearman, Kendall's distances) between the feature importance on real and synthetic data.

Can't suppress warnings when evaluating xgb performance

When evaluating xgb performance metric for dpgan and pategan synthetic models, the console is spammed with warnings from xgbse. warnings.filterwarnings("ignore") does not suppress them.

Here's the code I'm running.

syn_model = serialization.load_from_file("some_saved_dpgan_model.bkp")
selected_metrics = {
    'performance': ['xgb'],
}
my_metrics = Metrics()
selected_metrics_in_my_metrics = {k: my_metrics.list()[k] for k in my_metrics.list().keys() & selected_metrics.keys()}
X_syn = syn_model.generate(count=6882)
evaluation = my_metrics.evaluate(
    loader,
    X_syn,
    task_type="survival_analysis",
    metrics=selected_metrics_in_my_metrics,
    workspace="workspace",
)

Early stopping

A question that we are almost certain to get is how to set the number of training iterations.

I propose to implement an early stopping mechanism that the user can choose to enable. The user can supply a dictionary of {metric: weight}. We calculate the weighted sum of several metrics (e.g. 0.8 * MMD + 0.2 * performance), and do early stopping on that (they also specify patience parameters and so on).

[Plugin] Saving generative models

Hi Bogdan, what's the best way to save a trained generator?

I tried pickle on CT-GAN but it has an error:

_pickle.PicklingError: Can't pickle <class 'plugin_ctgan.py.CTGANPlugin'>: import of module 'plugin_ctgan.py' failed

Do you think we can add a save (and load) method for the plugin class?

Progress bar and logging during training

The training procedure can take a long time. We should add a progress bar or some logging message during training to inform the user. They can control the verbosity of the message by changing the logging level.

Add DP-GAN

[Models] Add TabNet support

https://github.com/dreamquark-ai/tabnet

Checking directory exists before saving to file

Description

The save_to_file function (utils/serialization.py) does not check if the file directory exists. When it does not, it returns a FileNotFound error. The improvement is about adding the additional check, and create the directory if it does not exist, before writing to the file.

Are you interested in working on this improvement yourself?

Yes, I am.

Additional Context

Note the directory 'saved_models/' does not exist.

 19 def save_to_file(path: Union[str, Path], model: Any) -> Any:
---> 20     with open(path, "wb") as f:
     21         return cloudpickle.dump(model, f)

FileNotFoundError: [Errno 2] No such file or directory: 'saved_models/XXX.bkp'

diffprivlib install fails on Windows

the issue is related with crlibm

WIP: IBM/differential-privacy-library#66

[Metrics] Supporting user-specified target column in eval_performance

In metrics.eval_performance.py, the target column is assumed to be the last column in the dataset:

synthcity/src/synthcity/metrics/eval_performance.py

Line 120 in 3289ec9

target_col = X_gt.columns[-1]

We should allow the user to specify the target column (e.g. by passing the column name or index). Thanks.

[Models] Neural Spline Flows

TS MLP activation issue

synthcity/src/synthcity/plugins/core/models/mlp.py

Line 129 in a9b4967

out[:, split : split + step] = activation(X[:, split : split + step])

This line won't worked corretly for time series

Add DECAF

Use `varz` for D-STRUCT

https://github.com/wesselb/varz#minimise-a-function-using-l-bfgs-b-in-pytorch

[Metrics, Bug] Delta_coverage_beta returns negative values

Hi, Tennison found that occasionally the metric Delta_coverage_beta returns negative values, but it should be non-negative by definition. Could you please take a look? Thanks.

Error in fitting privbayes on categorical data

I'm hitting an error when fitting privbayes on a dataset containing both numerical fields and categorical text fields. I do not seem to hit the same error for a datset soley comprised of numerical data.

The code:

X = pd.read_csv("...") # Read a csv file containing both numerical fields and categorical text fields.
loader = GenericDataLoader(X, target_column="some_column", sensitive_features=["some_sensitive_columns],)
syn_model = Plugins().get("privbayes")
syn_model.fit(loader)

Here's the traceback:
"""
Traceback (most recent call last):
File "tutorials/privbayes_error.py", line 29, in
syn_model.fit(loader)
File "pydantic/decorator.py", line 40, in pydantic.decorator.validate_arguments.validate.wrapper_function
from contextlib import _GeneratorContextManager
File "pydantic/decorator.py", line 134, in pydantic.decorator.ValidatedFunction.call