ydataai / ydata-synthetic Goto Github PK

Synthetic data generators for tabular and time-series data

Home Page: https://docs.synthetic.ydata.ai

License: MIT License

Python 1.30% Makefile 0.01% Shell 0.01% Jupyter Notebook 98.68%

gan-architectures gan deep-learning synthetic-data tensorflow2 machine-learning training-data python3 datagenerator datageneration

ydata-synthetic's Introduction

Join us on

YData Synthetic

A package to generate synthetic tabular and time-series data leveraging the state of the art generative models.

🎊 The exciting features:

These are must try features when it comes to synthetic data generation:

A new streamlit app that delivers the synthetic data generation experience with a UI interface. A low code experience for the quick generation of synthetic data

A new fast synthetic data generation model based on Gaussian Mixture. So you can quickstart in the world of synthetic data generation without the need for a GPU.

A conditional architecture for tabular data: CTGAN, which will make the process of synthetic data generation easier and with higher quality!

Synthetic data

What is synthetic data?

Synthetic data is artificially generated data that is not collected from real world events. It replicates the statistical components of real data without containing any identifiable information, ensuring individuals' privacy.

Why Synthetic Data?

Synthetic data can be used for many applications:

Privacy compliance for data-sharing and Machine Learning development
Remove bias
Balance datasets
Augment datasets

Looking for an end-to-end solution to Synthetic Data Generation?
YData Fabric enables the generation of high-quality datasets within a full UI experience, from data preparation to synthetic data generation and evaluation.
Check out the Community Version.

ydata-synthetic

This repository contains material related with architectures and models for synthetic data, from Generative Adversarial Networks (GANs) to Gaussian Mixtures. The repo includes a full ecosystem for synthetic data generation, that includes different models for the generation of synthetic structure data and time-series. All the Deep Learning models are implemented leveraging Tensorflow 2.0. Several example Jupyter Notebooks and Python scripts are included, to show how to use the different architectures.

Are you ready to learn more about synthetic data and the bext-practices for synthetic data generation?

Quickstart

The source code is currently hosted on GitHub at: https://github.com/ydataai/ydata-synthetic

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ydata-synthetic

The UI guide for synthetic data generation

YData synthetic has now a UI interface to guide you through the steps and inputs to generate structure tabular data. The streamlit app is available form v1.0.0 onwards, and supports the following flows:

Train a synthesizer model
Generate & profile synthetic data samples

Installation

pip install ydata-synthetic[streamlit]

Quickstart

Use the code snippet below in a python file (Jupyter Notebooks are not supported):

from ydata_synthetic import streamlit_app

streamlit_app.run()

Or use the file streamlit_app.py that can be found in the examples folder.

python -m streamlit_app

The below models are supported:

CGAN
WGAN
WGANGP
DRAGAN
CRAMER
CTGAN

Examples

Here you can find usage examples of the package and models to synthesize tabular data.

Fast tabular data synthesis on adult census income dataset
Tabular synthetic data generation with CTGAN on adult census income dataset
Time Series synthetic data generation with TimeGAN on stock dataset
Time Series synthetic data generation with DoppelGANger on FCC MBA dataset
More examples are continuously added and can be found in /examples directory.

Datasets for you to experiment

Here are some example datasets for you to try with the synthesizers:

Tabular datasets

Sequential datasets

Project Resources

In this repository you can find the several GAN architectures that are used to create synthesizers:

Tabular data

Sequential data

Contributing

We are open to collaboration! If you want to start contributing you only need to:

Search for an issue in which you would like to work. Issues for newcomers are labeled with good first issue.
Create a PR solving the issue.
We would review every PRs and either accept or ask for revisions.

Support

For support in using this library, please join our Discord server. Our Discord community is very friendly and great about quickly answering questions about the use and development of the library. Click here to join our Discord community!

FAQs

Have a question? Check out the Frequently Asked Questions about ydata-synthetic. If you feel something is missing, feel free to book a beary informal chat with us.

License

MIT License

ydata-synthetic's People

Contributors

Stargazers

Watchers

Forkers

hejiaxing97 kayodeakanni ibadia anu-bioinfo ubabe53 maybeee18 ceshine amirunpri2018 abyingh mnal-andoli chaitanya-kolliboyina direct2subhajit verceti-ke bhaskarbharat o-senpai-o enotdima muhammad-ahmad-baig cxz akanz1 fzhurd shahip2016 zeyefkey sudhakarkr gmartinsribeiro crownpku odawo pinjutien aung2phyowai kundjanasith mritulac krishnamanoj-kota zoetsekas ml-lab mglcampos azeezhamzat sancakozdemir saradindusengupta surajitdb dimichgh giorgtsial x10-utils jw3498 raedkit master-timo muhammadriz birand nielknetraam hitum-dev hkhdair ankit51 jessicarudd archity longluu yangchenghuang rodrigoabrao boukhasmarsoufian guanyingjiang bybrooks jaswanth-03 junpyopark yutiansut rolypolybunz amarjitghuman riddhitanna ninguemd dhutexas manojkesani jcamacaro marcusmadumo adrianxsalazar kaust-merge telixia deephaejoong pibic2021 ysu301coderepositories mike030668 mohsensharifi1991 cl19951225 eperrier iamjoshbinder cmdahl voquangtuong nategeorge austejap buihaduong xalarrea jamesanto praveenpaikadan wang4537 abc123yuanrui michellebonat poisonivysaur ashifur57 xinfushe dchouren wuyuan1139 lixy4567 hing-aglaia-wong rm-rf-etc orishko-py

ydata-synthetic's Issues

[BUG] Examples imports of YData lib are not up to date.

Describe the bug
Import error due to non updated ydata examples.

To Reproduce
Run an file from the examples folder. The import will fail.

Expected behavior

from ydata_synthetic.synthesizers.regular import *
ydata_synthetic.synthesizers.timeseries import *

Screenshots

[BUG] Supervisor train missing generator variables (TimeGAN)

Expected behavior
Inside ../timegan/model.py line 141 of your codebase should also have generator trainable variables.

Cramer GAN correction

The objective of this task is to review and correct the implementation of CramerGAN.
The code to be corrected can be found here: https://github.com/ydataai/ydata-synthetic/tree/feat/cramer

Priority: High

Effort: Low

Article: https://arxiv.org/abs/1705.10743

Repository: https://github.com/ydataai/ydata-synthetic/

Add load method to the GAN class.

The method save is already implemented. The objective is to have a similar load method to ease the re-use of trained models.

[FEAT] PATEGAN dependecies development

Define the generator, and discriminator models to be used by PATEGAN.

References:
Original article https://openreview.net/forum?id=S1zk9iRqF7

[CHORE] Change existing examples to remove previous transformations methods.

With the addition of the new DataProcessor, there is no longer the need to use the transformations functions defined for each type of dataset.

Task:

Remove the use of transformations() from the regular synthesizers examples

[FEAT] CTGAN sub model structures

Create the conditional generator and critic models

[BUG] Warning in TimeGAN training when seq_len does not match hidden_dim

Hello,

I tried running the TimeGAN example, but setting the seq_len to be different than the hidden_dim generates a warning during training. This seems like a bug. Please see attached snapshots.

I am still learning about TimeGANs but reviewing define_gan() in the model.py code, I noticed that the Embedding network has Input Shape (hidden_dim x n_seq) while the discriminator has Input Shape (seq_len x n_seq). Shouldn't these be swapped?

Appreciate your help.

Thanks & Best Regards,
Ajay.

[Question] TimeGAN postprocessing of generated data

A question about TimeGAN postprocessing.
I'm using TimeGAN to experiment with generation of stock returns of a handful of stocks simultaneously, and I'm using a sequence length of 24 following the seminal paper by Yoon 2019.

After successfully running TimeGAN on the input data I end up with a multi-dimensional array of shape (5000, 24, 10) where 5000 is the generated length, 24 is the sequence length and 10 is the number of stocks.

Now I want to take the generated sequences and produce a matrix (x, 10) where x is the resulting matrix length, so that I can use it in my subsequent experiments. How do I convert (5000, 24, 10) to (x,10). Do I just reshape the array or is there a better way?

[FEAT] auxiliary processing functions

Support synthesizers with a DataProcessor class for synth training and inverse transforms for the synthetic samples.
Synthesizers sample method should leverage the inverse_transform of the DataProcessor on the synthetic data to return it in the original data format.

[BUG] CGAN model passes class column as data

The CGAN model expects the class to be defined in the data array. The class column is not filtered out before being supplied to the CGAN.

Example

Consider a dataset with 10 parameters and 1 class (0 or 1). The data within synthesizer.train(data=train_data, train_arguments=train_args) is an array of n_samples x 11. The entire data set is provided to the GAN as input data, including the class column, which is considered separately.

            batch_x = self.get_data_batch(data, self.batch_size)
            label = batch_x[:, train_arguments.label_dim]
            d_loss_real = self.discriminator.train_on_batch([batch_x, label], valid)

In the above line batch_x already contains label. This should be removed.

[FEAT] Create an interface layer connecting categorical distributions to specific Gumbel-Softmax layers

Interface layer implementation

To process the logits outputs of a Generator correctly, information about the type of features being synthesized is necessary.
The idea of this layer is to leverage information available in the DataProcessor of the synthesizer in order to create Gumbel-Softmax activations for each of the categorical distributions, TanH/ReLU activations for other features and concatenating all outputs in the original order.
In the end this layer should be initialized leveraging just attributes of the processor and work as a model component like any other TF/Keras layer.

Requirements:

TF implementation
Gumbel-Softmax layer implementation (see below)
Tests

Gumbel-Softmax layer implementation

A Gumbel-Softmax layer works on logits or probability outputs of a model over the distribution of a categorical feature.
The required computations of this layer's forward method are the following:

Soft sample: Softmax of a tensor consisting of a Gumbel sample added to logits. The gradients of this computation are stored.
Hard sample: A one-hot encoded categorical variable with a stochastically sampled class activated. The gradients of this computation are not stored, but used only for obtaining real categorical samples.

Note: Generally only the hard sample should be returned in the forward call, alternatively both can be returned in an (hard, soft) tuple.

Requirements:

TF implementation
A standalone custom layer implemented in utils that produces the requested output assuming as input logits from a single categorical distribution
Tests

Create subdomain slack.ydata.ai

Create subdomain slack.ydata.ai pointing to our community slack invitation

Create it on Route53 that is currently managing our domain

Create a monthly recurrent task to update the slack URL invitation - it expires every 30 days for security reasons

[BUG] Cannot load saved model when using GumbelSoftmaxActivation SD-73

Describe the bug
Hi all, I encountered this error when trying to load a trained model using WGAN_GP.

ValueError: Unknown layer: Synthetic Data>GumbelSoftmaxActivation. Please ensure this object is passed to the custom_objects argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

To Reproduce
Steps to reproduce the behavior:

Run the example adult_wgangp.py

Expected behavior
It should load the saved model successfully.

Desktop (please complete the following information):

OS: Ubuntu Desktop 20.04
Cuda version: 11.3
tensorflow version: 2.7.0
keras version: 2.7.0

Adding new datasets examples

Integrate Tensorboard for monitoring.

[BUG] How to inverse_transform (One-Hot) float number. + ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Hi, I am trying to run your code "adult_wgangp.py".

(1) Problem 1 : Bug
I have an error here "processed_data = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(processed_data))" ,
and the error is "ValueError: Specifying the columns using strings is only supported for pandas DataFrames".

Slove
I separate these two methods and got transformed dataframe. (One-Hot + Standard Scalar)

(2) Problem 2 : How to inverse_transform (One-Hot) float value.

I want to combine synthetic data with original minority class data, then I will feed the combined dataframe (balanced data = synthetic data + original minority data) to model.
But the code here note that it is not inverse processed.
After generation, data_sample is the generated data as below.

#Sampling the data
#Note that the data returned it is not inverse processed.
data_sample = synth.sample(100000)

After the transform processed, discrete data is converted to one-hot vector [0,0,1], and then generated by GAN, which will become [-0.009592 -0.128386 1.009473].
Because I want to combine synthetic data and original minority data, I have to inverse_transform synthetic data.
But for inverse_transform, the one hot value must be 0 or 1, not float value. (-0.009592 -0.128386 1.009473).
I don't know how to make the float value to 0 or 1 in discrete column.
I really don't know how to deal with it and search many methods, but I still can't solve it.
I am wandering if you write a notebook to let me know how to solve it ?

Thank you,
Lily

[BUG] IndexError on loading a saved model

Describe the bug
If I'm running the code in a host without the GPU enabled (i.e. in google collab with the runtime set to CPU), I can't load a saved model.
The error is raised in this line, at the load function:
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

I don't know if this should be the expected behavior. If so, sorry to bother you.

[BUG] Kernel crash when importing TimeGan in Jupiter

Hi,

I would like to experiment Time Series Gan in Jupiter Notebook.
I pip installed the package, but when i try to import it the Kernel crash.
(from ydata_synthetic.synthesizers.timeseries import TimeGAN).

I am on a M1 Mac, supposed to be pretty powerful. Do you have any idea of the issue here?

Thanks for the good work !

[BUG] Save/Load won't work due to lambda functions

Describe the bug
While saving the TimeGAN model an error is returned due to the use of lambda functions for the optimizers.

To Reproduce
Steps to reproduce the behavior:

Train the StockGAN example.
Save the trained model

Expected behavior
To be able to save and load the TimeGAN model with no errors.

Screenshots

[FEAT] how to deal with categorical data ?

I saw a article advising using one hot encoder with noise. Do you think it's possible to create a feature about this ?

Add WGAN with Gradient Penalty

[FEAT]How to feed TimeGAN with different input/output array sizes

I have many groups of sequence data, such as the stock data, I want to train GOOGL,AAPL,MSFT... in one model，but GOOGL/AAPL/MSFT have different sequence length. I hope the output can generate a similar variable length sequence.
I want to know how to use this package to solve this type of problem.

[BUG]

First of all, thank you for the work.

Describe the bug
I am trying to train a TimeGAN on my own dataset.
I used the preprocessing function, i.e, real_data_loading() func, from preprocessing.utils to prepare the data.
And all my parameter settings are the same as what you explained in the this jupyter notebook, except for changing the number of sequences which is equal to the number of the columns of my dataset.

To Reproduce
Steps to reproduce the behavior:

Go to 'https://github.com/Sorooshi/SuperOX'
Click on 'u06-TGAN.ipynb'
Scroll down to 'cell [19]'
See error

Expected behavior
The expected dimension should be (69, 1, 10) where 69 is batch_size.

Desktop (please complete the following information):

OS: Ubuntu 18.

Additional context
Add any other context about the problem here.

[FEAT] Add new TimeSeries DataProcessor

Is your feature request related to a problem? Please describe.
To remove the friction associated with using new datasets with our current synthesizers models, there's a need to support a new object that is responsible for the data processing.

Describe the solution you'd like

Include an Abstract class called BasedProcessor
Include a TimeSeries DataProcessor (soes not support time series)
Include a DataProcessor for regular tabular data

Add Differential Privacy GAN model.

[FEAT] Landing page

Describe the solution you'd like
Landing page for the syntheticdata.community

Two call to action:

Join our slack: slack.ydata.ai
Check out our GitHub: https://github.com/ydataai/ydata-synthetic

The footer (or a section below) should mention that YData is the company behind the community

Describe alternatives you've considered
Alternative 1: Use materials/design/etc from YData website
Alternative 2: Use other template

Additional context
Logos in attachment
Synthetic data.zip

[FEAT] new synth class

Write the define_gan method with all required components (generator model, teacher models, student model)
Write the train method with all required complements (loss functions, make batch etc.).
Define a sample method that can be used to create any number of synthetic records from the trained synthesizer.

[BUG] Incorrect computation of the supervised loss in train_supervisor of TimeGAN.

Incorrect computation of the supervised loss in train_supervisor of TimeGAN.

I believe there is a mistake in the way the supervised loss g_loss_s is computed inside train_supervisor. You compute:
g_loss_s = self._mse(h[:, 1:, :], h_hat_supervised[:, 1:, :]) .
This aligns with earlier commits of the original paper code. However, this step would merely learn an identity mapping.

Based on the paper and the new commits of the authors, I believe the loss should be computed as follows:
g_loss_s = self._mse(h[:, 1:, :], h_hat_supervised[:, :-1, :]) . This should help learning the temporal dependencies between time steps and help guiding the generator.

Finally, I have a question regarding the gradients of the generator for the supervised training. In the second step of training TGAN, the supervisor and generator are optimised on g_loss_s. However, the generator is not contributing to the computations and thus has no gradients. Do you have an explanation for this?

[BUG]

This error occurs when I try to install ydata-synthetic lib, any idea if i'm doing something wrong?

[BUG] Slack link not working

"Invitation expired
It looks like you’re trying to accept an invitation, but the link was deactivated."

[FEAT] Add practical example using TimeGAN to generate data.

Currently we have some methods in our lib that are missing examples with a datasets.

Describe the solution you'd like
Add .py or Jupyter Notebook with an example.

[FEAT] Integrate the DataProcessor object into current regular synthesizers process.

Description

To remove the friction of using new datasets with the current architectures, we are offering a new object DataProcessor that will perform basic preprocessing on the provided data. This preprocessing will run at the training step.

Tasks:

Validate the feasibility of integrating the DataProcessor into the BaseSynthesizer
Integrate DataProcessor in the current training flow.
Set an input parameter for the :meth: train with the datatypes.
Set an input parameter for the :meth: train - processing - if set to 'True' the data needs to be processed, otherwise is assumed that the user already processed the data.

Add sample method to GAN class

Add sample method to GAN class. Re-use the code from the examples in the Google Colab to generate new synthetic data.

eg.
def sample(self, number_samples):
#define z
#Predict using the generator
#Return samples as a pandas df samples
return samples

[FEAT] how to deal with categorical features in TimeGAN? does TimeGAN converge?

Hi team,

This project is excellent, it makes much better sense to me as a tf2 person. Appreciate your work and sharing!!

I have two questions:

If my data has categorical data, may I still use the same architecture to generate synthetic time series data? In function read_data_loading you used MinMaxScaler() to scale continuous data, if I have categorical data, should I just do one-hot encoding and concatenate with scaled numerical data?
I notice discrimination loss is using binary cross entropy loss function, does this mean TimeGAN does not converge as the original GAN? I noticed you can save generator when TimeGAN finishes training, if it does not converge, should we save at evey n epochs and find out the best performing model?

Best regards,
Ling

[Question] WGAN-GP for timeseries data

Can I use WGAN-GP for timeseries data such as stock data? If yes do I need to preprocess the data to create sequences of say 24 data points in the same way described for the TimeGAN approach?
Any help or direction is much appreciated. Thanks.

[FEAT] PATEGAN class

Write the PATEGAN define_gan method with all required components (generator model, teacher models, student model)
Write the PATEGAN train method with all required complements (loss functions, make batch etc.).
Define a sample method that can be used to create any number of synthetic records from the trained synthesizer.

[FEAT] Create a util custom layer with Gumbel-Softmax for one categorical distribution

A Gumbel-Softmax layer works on logits or probability outputs of a model over the distribution of a categorical feature.
The required computations of this layer's forward method are the following:

Soft sample: Softmax of a tensor consisting of a Gumbel sample added to logits. The gradients of this computation are stored.
Hard sample: A one-hot encoded categorical variable with a sampled class activated. The gradients of this computation are not stored, but used only for obtaining real categorical samples.

Note: Generally only the hard sample should be returned in the forward call, alternatively both can be returned in an (hard, soft) tuple.

Requirements:

TF implementation
A standalone custom layer implemented in utils that produces the requested output assuming as input logits from a single categorical distribution

[BUG] Save methods from WGAN and WGAN-GP not able to save .pkl file

Describe the bug
When using the save method with WGAN and WGAN-GP architectures, the save method is not working. The following error is returned.

{PicklingError}Can't pickle <function make_gradient_clipnorm_fn.<locals>.<lambda> at 0x7fee10eb40e0>: it's not found as tensorflow.python.keras.optimizer_v2.utils.make_gradient_clipnorm_fn.<locals>.<lambda>

Expected behavior
All the models/architectures should be pickable and save using the .save() method from the class.

[BUG] AttributeError: 'TimeGAN' object has no attribute 'MODEL'

If you execute this notebook https://github.com/ydataai/ydata-synthetic/blob/master/examples/timeseries/TimeGAN_Synthetic_stock_data.ipynb

At the line synth.save() you'll get this:
AttributeError: 'TimeGAN' object has no attribute 'MODEL'

[FEAT] PATEGAN example

Create a PATEGAN example file that showcases the full functionality:

Use and transform an example dataset (p.e. adult dataset).
Train synthesizer.
Save and load synthesizer
Sample from synthesizer
Inverse transformation from the synthesized samples

[FEAT] Add CGAN for Time-Series

Time-series is one of the major points of research for data synthesis. The amount of research and offered open repos is low, so a key differentiator is the availability of such architecture for time-series.

Requirements: TF 2 implementation

Article: https://www.naun.org/main/NAUN/neural/2020/a082016-004(2020).pdf

Repository: https://github.com/ydataai/ydata-synthetic/

Priority: Medium

Effort: Medium

[BUG] Sample data after loading saved TimeGAN synth

After loading the synthesizer from a previously trained and saved model an error is returned while sampling.

To Reproduce
Steps to reproduce the behavior:

Train a TimeGAN synth
Save the trained model
Load the model
Sample data

Expected behavior
New synthetic data is generated.

Desktop (please complete the following information):

RTX 2070
Cuda version: 10.1
Tensorflow version: 2.3.*

Packaging and Release automation for PyPi: ydata-synthetic

Create build pipelines for the ydata-synthetic lib to be published on PyPi

[CHORE] Integrate the interface layer in the regular synthesizer generators

Leveraging the implementation for issue #123 integrate the interface layer in the output of all regular GAN generators.

evaluate result

Thanks for your wonderful work!

I have a question. How can I evaluate my result in this repo.

Thanks again!

[FEAT] Return scaler object for stock data, or save scaler in .pkl file

Is your feature request related to a problem? Please describe.
Nope, just an idea. Using the stock timeseries example, it would be good if the scaler object could be saved into a pickle file, or returned with the data it scales, so that we could inverse transform the stock data after synthesis.

Additional context
Something like:

from pickle import dump, load
# save the scaler
dump(scaler, open('scaler.pkl', 'wb'))

# load the scaler later
scaler = load(open('scaler.pkl', 'rb'))

Could plug in the first part in here:

Thanks!

[BUG] Error with tensorflow using WGANGP

Hi!

I am trying to use WGANGP from your GitHub repository and I have the following error: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'.

The days before I was able to use this model without any problem, but today it gives me this error. I don't know how I can fix it.

Thank you,
Mikel

Where is the function processed_stock?

While running the below code inn The data section of TimeGAN - Synthetic stock data.ipynb file, I am getting error:
stock_data = processed_stock(seq_len=seq_len)
print(len(stock_data),stock_data[0].shape)

Error:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/ydata_synthetic/preprocessing/data/stock.csv'

[FEAT] Training visualisations to compare real vs generated data

It would great if you could include some visualisation plots to show how the loss (cost) functions behave during training. One such visualisation can be a graph that illustrates the generator and discriminator loss with epochs/iterations on the x-axis.

These graphs would help evaluate the behaviour of the generator/discriminator models, such as to check for mode collapse and to compare real data distributions with generated data distributions.

ydataai / ydata-synthetic Goto Github PK

ydata-synthetic's Introduction

YData Synthetic

🎊 The exciting features:

Synthetic data

What is synthetic data?

Why Synthetic Data?

ydata-synthetic

Quickstart

The UI guide for synthetic data generation

Installation

Quickstart

Examples

Datasets for you to experiment

Tabular datasets

Sequential datasets

Project Resources

Tabular data

Sequential data

Contributing

Support

FAQs

License

ydata-synthetic's People

Contributors

Stargazers

Watchers

Forkers

ydata-synthetic's Issues

Example

Interface layer implementation

Gumbel-Softmax layer implementation

Description

Tasks:

Recommend Projects

Recommend Topics

Recommend Org

Jobs