GithubHelp home page GithubHelp logo

ydataai / ydata-synthetic Goto Github PK

View Code? Open in Web Editor NEW
1.3K 31.0 218.0 16.23 MB

Synthetic data generators for tabular and time-series data

Home Page: https://docs.synthetic.ydata.ai

License: MIT License

Python 1.30% Makefile 0.01% Shell 0.01% Jupyter Notebook 98.68%
gan-architectures gan deep-learning synthetic-data tensorflow2 machine-learning training-data python3 datagenerator datageneration

ydata-synthetic's Introduction

YData Synthetic Logo

Join us on Discord

YData Synthetic

A package to generate synthetic tabular and time-series data leveraging the state of the art generative models.

🎊 The exciting features:

These are must try features when it comes to synthetic data generation:

  • A new streamlit app that delivers the synthetic data generation experience with a UI interface. A low code experience for the quick generation of synthetic data
  • A new fast synthetic data generation model based on Gaussian Mixture. So you can quickstart in the world of synthetic data generation without the need for a GPU.
  • A conditional architecture for tabular data: CTGAN, which will make the process of synthetic data generation easier and with higher quality!

Synthetic data

What is synthetic data?

Synthetic data is artificially generated data that is not collected from real world events. It replicates the statistical components of real data without containing any identifiable information, ensuring individuals' privacy.

Why Synthetic Data?

Synthetic data can be used for many applications:

  • Privacy compliance for data-sharing and Machine Learning development
  • Remove bias
  • Balance datasets
  • Augment datasets

Looking for an end-to-end solution to Synthetic Data Generation?
YData Fabric enables the generation of high-quality datasets within a full UI experience, from data preparation to synthetic data generation and evaluation.
Check out the Community Version.

ydata-synthetic

This repository contains material related with architectures and models for synthetic data, from Generative Adversarial Networks (GANs) to Gaussian Mixtures. The repo includes a full ecosystem for synthetic data generation, that includes different models for the generation of synthetic structure data and time-series. All the Deep Learning models are implemented leveraging Tensorflow 2.0. Several example Jupyter Notebooks and Python scripts are included, to show how to use the different architectures.

Are you ready to learn more about synthetic data and the bext-practices for synthetic data generation?

Quickstart

The source code is currently hosted on GitHub at: https://github.com/ydataai/ydata-synthetic

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ydata-synthetic

The UI guide for synthetic data generation

YData synthetic has now a UI interface to guide you through the steps and inputs to generate structure tabular data. The streamlit app is available form v1.0.0 onwards, and supports the following flows:

  • Train a synthesizer model
  • Generate & profile synthetic data samples

Installation

pip install ydata-synthetic[streamlit]

Quickstart

Use the code snippet below in a python file (Jupyter Notebooks are not supported):

from ydata_synthetic import streamlit_app

streamlit_app.run()

Or use the file streamlit_app.py that can be found in the examples folder.

python -m streamlit_app

The below models are supported:

  • CGAN
  • WGAN
  • WGANGP
  • DRAGAN
  • CRAMER
  • CTGAN

Watch the video

Examples

Here you can find usage examples of the package and models to synthesize tabular data.

  • Fast tabular data synthesis on adult census income dataset Open in Colab
  • Tabular synthetic data generation with CTGAN on adult census income dataset Open in Colab
  • Time Series synthetic data generation with TimeGAN on stock dataset Open in Colab
  • Time Series synthetic data generation with DoppelGANger on FCC MBA dataset Open in Colab
  • More examples are continuously added and can be found in /examples directory.

Datasets for you to experiment

Here are some example datasets for you to try with the synthesizers:

Tabular datasets

Sequential datasets

Project Resources

In this repository you can find the several GAN architectures that are used to create synthesizers:

Tabular data

Sequential data

Contributing

We are open to collaboration! If you want to start contributing you only need to:

  1. Search for an issue in which you would like to work. Issues for newcomers are labeled with good first issue.
  2. Create a PR solving the issue.
  3. We would review every PRs and either accept or ask for revisions.

Support

For support in using this library, please join our Discord server. Our Discord community is very friendly and great about quickly answering questions about the use and development of the library. Click here to join our Discord community!

FAQs

Have a question? Check out the Frequently Asked Questions about ydata-synthetic. If you feel something is missing, feel free to book a beary informal chat with us.

License

MIT License

ydata-synthetic's People

Contributors

aquemy avatar archity avatar arunnthevapalan avatar ceshine avatar cptanalatriste avatar crownpku avatar dependabot[bot] avatar fabclmnt avatar fanconic avatar gmartinsribeiro avatar jfsantos-ds avatar ljmatkins avatar mglcampos avatar miriamspsantos avatar portellaa avatar rajeshai avatar renovate[bot] avatar ricardodcpereira avatar strickvl avatar ubabe53 avatar vascoalramos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ydata-synthetic's Issues

[BUG] Examples imports of YData lib are not up to date.

Describe the bug
Import error due to non updated ydata examples.

To Reproduce
Run an file from the examples folder. The import will fail.

Expected behavior

from ydata_synthetic.synthesizers.regular import *
ydata_synthetic.synthesizers.timeseries import *

Screenshots
image

[BUG] Warning in TimeGAN training when seq_len does not match hidden_dim

Hello,

I tried running the TimeGAN example, but setting the seq_len to be different than the hidden_dim generates a warning during training. This seems like a bug. Please see attached snapshots.
TimeGANParameters
TimeGANTraining

I am still learning about TimeGANs but reviewing define_gan() in the model.py code, I noticed that the Embedding network has Input Shape (hidden_dim x n_seq) while the discriminator has Input Shape (seq_len x n_seq). Shouldn't these be swapped?
TimeGANModelCode

Appreciate your help.

Thanks & Best Regards,
Ajay.

[Question] TimeGAN postprocessing of generated data

A question about TimeGAN postprocessing.
I'm using TimeGAN to experiment with generation of stock returns of a handful of stocks simultaneously, and I'm using a sequence length of 24 following the seminal paper by Yoon 2019.

After successfully running TimeGAN on the input data I end up with a multi-dimensional array of shape (5000, 24, 10) where 5000 is the generated length, 24 is the sequence length and 10 is the number of stocks.

Now I want to take the generated sequences and produce a matrix (x, 10) where x is the resulting matrix length, so that I can use it in my subsequent experiments. How do I convert (5000, 24, 10) to (x,10). Do I just reshape the array or is there a better way?

[FEAT] auxiliary processing functions

Support synthesizers with a DataProcessor class for synth training and inverse transforms for the synthetic samples.
Synthesizers sample method should leverage the inverse_transform of the DataProcessor on the synthetic data to return it in the original data format.

[BUG] CGAN model passes class column as data

The CGAN model expects the class to be defined in the data array. The class column is not filtered out before being supplied to the CGAN.

Example

Consider a dataset with 10 parameters and 1 class (0 or 1). The data within synthesizer.train(data=train_data, train_arguments=train_args) is an array of n_samples x 11. The entire data set is provided to the GAN as input data, including the class column, which is considered separately.

            batch_x = self.get_data_batch(data, self.batch_size)
            label = batch_x[:, train_arguments.label_dim]
            d_loss_real = self.discriminator.train_on_batch([batch_x, label], valid) 

In the above line batch_x already contains label. This should be removed.

[FEAT] Create an interface layer connecting categorical distributions to specific Gumbel-Softmax layers

Interface layer implementation

To process the logits outputs of a Generator correctly, information about the type of features being synthesized is necessary.
The idea of this layer is to leverage information available in the DataProcessor of the synthesizer in order to create Gumbel-Softmax activations for each of the categorical distributions, TanH/ReLU activations for other features and concatenating all outputs in the original order.
In the end this layer should be initialized leveraging just attributes of the processor and work as a model component like any other TF/Keras layer.

Requirements:

  • TF implementation
  • Gumbel-Softmax layer implementation (see below)
  • Tests

Gumbel-Softmax layer implementation

A Gumbel-Softmax layer works on logits or probability outputs of a model over the distribution of a categorical feature.
The required computations of this layer's forward method are the following:

  • Soft sample: Softmax of a tensor consisting of a Gumbel sample added to logits. The gradients of this computation are stored.
  • Hard sample: A one-hot encoded categorical variable with a stochastically sampled class activated. The gradients of this computation are not stored, but used only for obtaining real categorical samples.

Note: Generally only the hard sample should be returned in the forward call, alternatively both can be returned in an (hard, soft) tuple.

Requirements:

  • TF implementation
  • A standalone custom layer implemented in utils that produces the requested output assuming as input logits from a single categorical distribution
  • Tests

Create subdomain slack.ydata.ai

Create subdomain slack.ydata.ai pointing to our community slack invitation

Create it on Route53 that is currently managing our domain

Create a monthly recurrent task to update the slack URL invitation - it expires every 30 days for security reasons

[BUG] Cannot load saved model when using GumbelSoftmaxActivation SD-73

Describe the bug
Hi all, I encountered this error when trying to load a trained model using WGAN_GP.

ValueError: Unknown layer: Synthetic Data>GumbelSoftmaxActivation. Please ensure this object is passed to the custom_objects argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

To Reproduce
Steps to reproduce the behavior:

  1. Run the example adult_wgangp.py

Expected behavior
It should load the saved model successfully.

Desktop (please complete the following information):

  • OS: Ubuntu Desktop 20.04
  • Cuda version: 11.3
  • tensorflow version: 2.7.0
  • keras version: 2.7.0

[BUG] How to inverse_transform (One-Hot) float number. + ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Hi, I am trying to run your code "adult_wgangp.py".

(1) Problem 1 : Bug
I have an error here "processed_data = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(processed_data))" ,
and the error is "ValueError: Specifying the columns using strings is only supported for pandas DataFrames".

  • Slove
    I separate these two methods and got transformed dataframe. (One-Hot + Standard Scalar)

(2) Problem 2 : How to inverse_transform (One-Hot) float value.

I want to combine synthetic data with original minority class data, then I will feed the combined dataframe (balanced data = synthetic data + original minority data) to model.
But the code here note that it is not inverse processed.
After generation, data_sample is the generated data as below.

#Sampling the data
#Note that the data returned it is not inverse processed.
data_sample = synth.sample(100000)

After the transform processed, discrete data is converted to one-hot vector [0,0,1], and then generated by GAN, which will become [-0.009592 -0.128386 1.009473].
Because I want to combine synthetic data and original minority data, I have to inverse_transform synthetic data.
But for inverse_transform, the one hot value must be 0 or 1, not float value. (-0.009592 -0.128386 1.009473).
I don't know how to make the float value to 0 or 1 in discrete column.
I really don't know how to deal with it and search many methods, but I still can't solve it.
I am wandering if you write a notebook to let me know how to solve it ?

Thank you,
Lily

[BUG] IndexError on loading a saved model

Describe the bug
If I'm running the code in a host without the GPU enabled (i.e. in google collab with the runtime set to CPU), I can't load a saved model.
The error is raised in this line, at the load function:
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

I don't know if this should be the expected behavior. If so, sorry to bother you.

[BUG] Kernel crash when importing TimeGan in Jupiter

Hi,

I would like to experiment Time Series Gan in Jupiter Notebook.
I pip installed the package, but when i try to import it the Kernel crash.
(from ydata_synthetic.synthesizers.timeseries import TimeGAN).

I am on a M1 Mac, supposed to be pretty powerful. Do you have any idea of the issue here?

Thanks for the good work !

[BUG] Save/Load won't work due to lambda functions

Describe the bug
While saving the TimeGAN model an error is returned due to the use of lambda functions for the optimizers.

To Reproduce
Steps to reproduce the behavior:

  1. Train the StockGAN example.
  2. Save the trained model

Expected behavior
To be able to save and load the TimeGAN model with no errors.

Screenshots
image

[FEAT]How to feed TimeGAN with different input/output array sizes

I have many groups of sequence data, such as the stock data, I want to train GOOGL,AAPL,MSFT... in one model,but GOOGL/AAPL/MSFT have different sequence length. I hope the output can generate a similar variable length sequence.
I want to know how to use this package to solve this type of problem.

[BUG]

First of all, thank you for the work.

Describe the bug
I am trying to train a TimeGAN on my own dataset.
I used the preprocessing function, i.e, real_data_loading() func, from preprocessing.utils to prepare the data.
And all my parameter settings are the same as what you explained in the this jupyter notebook, except for changing the number of sequences which is equal to the number of the columns of my dataset.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'https://github.com/Sorooshi/SuperOX'
  2. Click on 'u06-TGAN.ipynb'
  3. Scroll down to 'cell [19]'
  4. See error

Expected behavior
The expected dimension should be (69, 1, 10) where 69 is batch_size.

Desktop (please complete the following information):

  • OS: Ubuntu 18.

Additional context
Add any other context about the problem here.

[FEAT] Add new TimeSeries DataProcessor

Is your feature request related to a problem? Please describe.
To remove the friction associated with using new datasets with our current synthesizers models, there's a need to support a new object that is responsible for the data processing.

Describe the solution you'd like

  • Include an Abstract class called BasedProcessor
  • Include a TimeSeries DataProcessor (soes not support time series)
  • Include a DataProcessor for regular tabular data

[FEAT] new synth class

Write the define_gan method with all required components (generator model, teacher models, student model)
Write the train method with all required complements (loss functions, make batch etc.).
Define a sample method that can be used to create any number of synthetic records from the trained synthesizer.

[BUG] Incorrect computation of the supervised loss in train_supervisor of TimeGAN.

Incorrect computation of the supervised loss in train_supervisor of TimeGAN.

I believe there is a mistake in the way the supervised loss g_loss_s is computed inside train_supervisor. You compute:
g_loss_s = self._mse(h[:, 1:, :], h_hat_supervised[:, 1:, :]) .
This aligns with earlier commits of the original paper code. However, this step would merely learn an identity mapping.

Based on the paper and the new commits of the authors, I believe the loss should be computed as follows:
g_loss_s = self._mse(h[:, 1:, :], h_hat_supervised[:, :-1, :]) . This should help learning the temporal dependencies between time steps and help guiding the generator.

Finally, I have a question regarding the gradients of the generator for the supervised training. In the second step of training TGAN, the supervisor and generator are optimised on g_loss_s. However, the generator is not contributing to the computations and thus has no gradients. Do you have an explanation for this?

[BUG]

This error occurs when I try to install ydata-synthetic lib, any idea if i'm doing something wrong?
image

[FEAT] Integrate the DataProcessor object into current regular synthesizers process.

Description

To remove the friction of using new datasets with the current architectures, we are offering a new object DataProcessor that will perform basic preprocessing on the provided data. This preprocessing will run at the training step.

Tasks:

  • Validate the feasibility of integrating the DataProcessor into the BaseSynthesizer
  • Integrate DataProcessor in the current training flow.
  • Set an input parameter for the :meth: train with the datatypes.
  • Set an input parameter for the :meth: train - processing - if set to 'True' the data needs to be processed, otherwise is assumed that the user already processed the data.

Add sample method to GAN class

Add sample method to GAN class. Re-use the code from the examples in the Google Colab to generate new synthetic data.

eg.
def sample(self, number_samples):
#define z
#Predict using the generator
#Return samples as a pandas df samples
return samples

[FEAT] how to deal with categorical features in TimeGAN? does TimeGAN converge?

Hi team,

This project is excellent, it makes much better sense to me as a tf2 person. Appreciate your work and sharing!!

I have two questions:

  1. If my data has categorical data, may I still use the same architecture to generate synthetic time series data? In function read_data_loading you used MinMaxScaler() to scale continuous data, if I have categorical data, should I just do one-hot encoding and concatenate with scaled numerical data?

  2. I notice discrimination loss is using binary cross entropy loss function, does this mean TimeGAN does not converge as the original GAN? I noticed you can save generator when TimeGAN finishes training, if it does not converge, should we save at evey n epochs and find out the best performing model?

Best regards,
Ling

[Question] WGAN-GP for timeseries data

Can I use WGAN-GP for timeseries data such as stock data? If yes do I need to preprocess the data to create sequences of say 24 data points in the same way described for the TimeGAN approach?
Any help or direction is much appreciated. Thanks.

[FEAT] PATEGAN class

  1. Write the PATEGAN define_gan method with all required components (generator model, teacher models, student model)
  2. Write the PATEGAN train method with all required complements (loss functions, make batch etc.).
  3. Define a sample method that can be used to create any number of synthetic records from the trained synthesizer.

[FEAT] Create a util custom layer with Gumbel-Softmax for one categorical distribution

A Gumbel-Softmax layer works on logits or probability outputs of a model over the distribution of a categorical feature.
The required computations of this layer's forward method are the following:

  • Soft sample: Softmax of a tensor consisting of a Gumbel sample added to logits. The gradients of this computation are stored.
  • Hard sample: A one-hot encoded categorical variable with a sampled class activated. The gradients of this computation are not stored, but used only for obtaining real categorical samples.

Note: Generally only the hard sample should be returned in the forward call, alternatively both can be returned in an (hard, soft) tuple.

Requirements:

  • TF implementation
  • A standalone custom layer implemented in utils that produces the requested output assuming as input logits from a single categorical distribution

[BUG] Save methods from WGAN and WGAN-GP not able to save .pkl file

Describe the bug
When using the save method with WGAN and WGAN-GP architectures, the save method is not working. The following error is returned.

{PicklingError}Can't pickle <function make_gradient_clipnorm_fn.<locals>.<lambda> at 0x7fee10eb40e0>: it's not found as tensorflow.python.keras.optimizer_v2.utils.make_gradient_clipnorm_fn.<locals>.<lambda>

Expected behavior
All the models/architectures should be pickable and save using the .save() method from the class.

[FEAT] PATEGAN example

Create a PATEGAN example file that showcases the full functionality:

  1. Use and transform an example dataset (p.e. adult dataset).
  2. Train synthesizer.
  3. Save and load synthesizer
  4. Sample from synthesizer
  5. Inverse transformation from the synthesized samples

[BUG] Sample data after loading saved TimeGAN synth

After loading the synthesizer from a previously trained and saved model an error is returned while sampling.

image

To Reproduce
Steps to reproduce the behavior:

  1. Train a TimeGAN synth
  2. Save the trained model
  3. Load the model
  4. Sample data

Expected behavior
New synthetic data is generated.

Desktop (please complete the following information):

  • RTX 2070
  • Cuda version: 10.1
  • Tensorflow version: 2.3.*

evaluate result

Thanks for your wonderful work!

I have a question. How can I evaluate my result in this repo.

Thanks again!

[FEAT] Return scaler object for stock data, or save scaler in .pkl file

Is your feature request related to a problem? Please describe.
Nope, just an idea. Using the stock timeseries example, it would be good if the scaler object could be saved into a pickle file, or returned with the data it scales, so that we could inverse transform the stock data after synthesis.

Additional context
Something like:

from pickle import dump, load
# save the scaler
dump(scaler, open('scaler.pkl', 'wb'))

# load the scaler later
scaler = load(open('scaler.pkl', 'rb'))

Could plug in the first part in here:

image

Thanks!

[BUG] Error with tensorflow using WGANGP

Hi!

I am trying to use WGANGP from your GitHub repository and I have the following error: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'.

The days before I was able to use this model without any problem, but today it gives me this error. I don't know how I can fix it.

Thank you,
Mikel

Where is the function processed_stock?

While running the below code inn The data section of TimeGAN - Synthetic stock data.ipynb file, I am getting error:
stock_data = processed_stock(seq_len=seq_len)
print(len(stock_data),stock_data[0].shape)

Error:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/ydata_synthetic/preprocessing/data/stock.csv'

[FEAT] Training visualisations to compare real vs generated data

It would great if you could include some visualisation plots to show how the loss (cost) functions behave during training. One such visualisation can be a graph that illustrates the generator and discriminator loss with epochs/iterations on the x-axis.

These graphs would help evaluate the behaviour of the generator/discriminator models, such as to check for mode collapse and to compare real data distributions with generated data distributions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.