gretelai / gretel-synthetics Goto Github PK

View Code? Open in Web Editor NEW

534.0 25.0 82.0 2.34 MB

Synthetic data generators for structured and unstructured text, featuring differentially private learning.

Home Page: https://gretel.ai/platform/synthetics

License: Other

Python 99.73% Shell 0.27%

differential-privacy artificial-intelligence tensorflow privacy synthetic-data

gretel-synthetics's Introduction

Gretel Synthetics

A permissive synthetic data library from Gretel.ai

Documentation

Try it out now!

If you want to quickly discover gretel-synthetics, simply click the button below and follow the tutorials!

Check out additional examples here.

Getting Started

This section will guide you through installation of gretel-synthetics and dependencies that are not directly installed by the Python package manager.

Dependency Requirements

By default, we do not install certain core requirements, the following dependencies should be installed external to the installation of gretel-synthetics, depending on which model(s) you plan to use.

Tensorflow: Used by the LSTM model, we recommend version 2.11.x
Torch: Used by Timeseries DGAN and ACTGAN (for ACTGAN, Torch is installed by SDV), we recommend version 2.0
SDV (Synthetic Data Vault): Used by ACTGAN, we recommend version 0.17.x

These dependencies can be installed by doing the following:

pip install tensorflow==2.11 # for LSTM
pip install sdv<0.18 # for ACTGAN
pip install torch==2.0 # for Timeseries DGAN

To install the actual gretel-synthetics package, first clone the repo and then...

pip install -U .

pip install gretel-synthetics

then...

$ pip install jupyter
$ jupyter notebook

When the UI launches in your browser, navigate to examples/synthetic_records.ipynb and get generating!

If you want to install gretel-synthetics locally and use a GPU (recommended):

Create a virtual environment (e.g. using conda)

$ conda create --name tf python=3.9

Activate the virtual environment

$ conda activate tf

Run the setup script ./setup-utils/setup-gretel-synthetics-tensorflow24-with-gpu.sh

The last step will install all the necessary software packages for GPU usage, tensorflow=2.8 and gretel-synthetics. Note that this script works only for Ubuntu 18.04. You might need to modify it for other OS versions.

Timeseries DGAN Overview

The timeseries DGAN module contains a PyTorch implementation of a DoppelGANger model that is optimized for timeseries data. Similar to tensorflow, you will need to manually install pytorch:

pip install torch==1.13.1

This notebook shows basic usage on a small data set of smart home sensor readings.

ACTGAN Overview

ACTGAN (Anyway CTGAN) is an extension of the popular CTGAN implementation that provides some additional functionality to improve memory usage, autodetection and transformation of columns, and more.

To use this model, you will need to manually install SDV:

pip install sdv<0.18

Keep in mind that this will also install several dependencies like PyTorch that SDV relies on, which may conflict with PyTorch versions installed for use with other models like Timeseries DGAN.

The ACTGAN interface is a superset of the CTGAN interface. To see the additional features, please take a look at the ACTGAN demo notebook in the examples directory of this repo.

LSTM Overview

This package allows developers to quickly get immersed with synthetic data generation through the use of neural networks. The more complex pieces of working with libraries like Tensorflow and differential privacy are bundled into friendly Python classes and functions. There are two high level modes that can be utilized.

Simple Mode

The simple mode will train line-per-line on an input file of text. When generating data, the generator will yield a custom object that can be used a variety of different ways based on your use case. This notebook demonstrates this mode.

DataFrame Mode

This library supports CSV / DataFrames natively using the DataFrame "batch" mode. This module provided a wrapper around our simple mode that is geared for working with tabular data. Additionally, it is capable of handling a high number of columns by breaking the input DataFrame up into "batches" of columns and training a model on each batch. This notebook shows an overview of using this library with DataFrames natively.

Components

There are four primary components to be aware of when using this library.

Configurations. Configurations are classes that are specific to an underlying ML engine used to train and generate data. An example would be using TensorFlowConfig to create all the necessary parameters to train a model based on TF. LocalConfig is aliased to TensorFlowConfig for backwards compatibility with older versions of the library. A model is saved to a designated directory, which can optionally be archived and utilized later.
Tokenizers. Tokenizers convert input text into integer based IDs that are used by the underlying ML engine. These tokenizers can be created and sent to the training input. This is optional, and if no specific tokenizer is specified then a default one will be used. You can find an example here that uses a simple char-by-char tokenizer to build a model from an input CSV. When training in a non-differentially private mode, we suggest using the default SentencePiece tokenizer, an unsupervised tokenizer that learns subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) for faster training and increased accuracy of the synthetic model.
Training. Training a model combines the configuration and tokenizer and builds a model, which is stored in the designated directory, that can be used to generate new records.
Generation. Once a model is trained, any number of new lines or records can be generated. Optionally, a record validator can be provided to ensure that the generated data meets any constraints that are necessary. See our notebooks for examples on validators.

Utilities

In addition to the four primary components, the gretel-synthetics package also ships with a set of utilities that are helpful for training advanced synthetics models and evaluating synthetic datasets.

Some of this functionality carries large dependencies, so they are shipped as an extra called utils. To install these dependencies, you may run

pip install gretel-synthetics[utils]

For additional details, please refer to the Utility module API docs.

Differential Privacy

Differential privacy support for our TensorFlow mode is built on the great work being done by the Google TF team and their TensorFlow Privacy library.

When utilizing DP, we currently recommend using the character tokenizer as it will only create a vocabulary of single tokens and removes the risk of sensitive data being memorized as actual tokens that can be replayed during generation.

There are also a few configuration options that are notable such as:

predict_batch_size should be set to 1
dp should be enabled
learning_rate, dp_noise_multiplier, dp_l2_norm_clip, and dp_microbatches can be adjusted to achieve various epsilon values.
reset_states should be disabled

Please see our example Notebook for training a DP model based on the Netflix Prize dataset.

gretel-synthetics's People

Contributors

Stargazers

Watchers

Forkers

martinhavlicek eminentli gbaf hmoralesp connorgorman dgayathiri em3ndez yilu1021 shreyansgupta tsaoud penelopecorsica cshang2017 carlosvilavergara hacking-for-humanity barseghyanartur anthager shadiakiki1986 cecileloge manikandanj2207 data-augmentation nipulseervi dibyajyoti227 ayazona julesdebruin dlperf hamedmx drew dhutexas yangchenghuang vaibhavantil1 andrewnc marshalla114 hla55 anusha0609 hgascon avsolatorio edydkim simoncarterprof gracecvking ponder-lab tungom erik-liang hemanth7445 yym0093 trellixvulnteam taducbinh123 jimmy-inl ajunlonglive aletell ansari1375 francescaconselvan tmulatua vinay-k12 actalyst lukascironis gabrielsoto-inl ttamg accmohamedsaber arpitjain799 yuxinliu2023 ndavol jofmorenore psr6275 paulsunnypark jazzlee008 hueck y2ahsong siggyf mahmoudkh24 lbyiuou0329 sanyaade-projects hojjatkarami gsty 5l1v3r1 jon-collins siamakru brunoscaglione akrichikov ninja1703 mdwoicke tejas9019

gretel-synthetics's Issues

Performance issue in /src/gretel_synthetics/tensorflow (by P3)

Hello! I've found a performance issue in /train.py: .batch(store.batch_size, drop_remainder=True)(here) should be called before .map(_split_input_target)(here), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function _split_input_target called in .map(_split_input_target) whether to be affected or not to make the changed code work properly. For example, if _split_input_target needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

SentencePiece data tokenization to improve model accuracy

Add SentencePiece tokenizer to extend synthetics model from character-based encoding to support flexible token-based encoding, ideally training a tokenizer on raw input data. Versus a character-based encoding, this should be more efficient (grouping common characters into a single token), and result in better synthetic data creation (common spacing and character patterns learned by tokenizer).

See: subword units and byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo])

Goals:

Specify fixed vocabulary size for training (e.g. 100, 500, 1000)
Specify minimum character coverage (0.0-1.0)
Detokenize in a generalized, language and character set independent manner

ValueError: multiprocessing_context option should specify a valid start method in ['spawn'], but got multiprocessing_context='fork'[FR / BUG]

Hi there,

When running the following sample code it throws a multiprocessing error:

ValueError: multiprocessing_context option should specify a valid start method in ['spawn'], but got multiprocessing_context='fork'

# Create some random training data data
df = pd.DataFrame(np.random.random(size=(1000,30)))
df.columns = pd.date_range("2022-01-01", periods=30)
# Include an attribute column
df["attribute"] = np.random.randint(0, 3, size=1000)

# Train the model
model = DGAN(DGANConfig(
    max_sequence_len=30,
    sample_len=3,
    batch_size=1000,
    epochs=10,  # For real data sets, 100-1000 epochs is typical
))

model.train_dataframe(
    df,
    attribute_columns=["attribute"],
    discrete_columns=["attribute"],
)

# Generate synthetic data
synthetic_df = model.generate_dataframe(100)

synthetic_df

I am using a windows machine and jupyter notebook but have tested in terminal and same error.

Found this stackoverflow which suggests it is a windows issue?
https://stackoverflow.com/questions/76076183/how-do-i-set-multiprocessing-context-to-spawn-in-my-code

Is there any possibility the multiprocessing_context in the dgan dataloader could be modifiable?

Thanks for future help with this.

Jordan

TooManyInvalidError: Maximum number of invalid lines reached!

Hello,

I am trying to generate new lines for my dataframe which contains survey responses (So categorical variables that are not yet encoded). I ran the "synthetic_records" notebook hoping that it would create new responses that are similar but not identical to the ones already existing. When training, the RNN reached 91% accuracy, but when running the lines generating code the output kept giving me a list of lines in this format :

GenText(valid=False, text ="value1,value2,value3,etc...", explain='record not 6 parts', delimiter=',')

Here's the error message I got :

Marketoptiontend-analysis

Are you reporting a bug or FR?

Bug
Feature Request

What version of synthetics are you using?

What would you like to see / What problem are you having?

Configuration Params

Are you using GPU or a CPU?

What environment are you working in?

What version of python are you using?

<3.6/3.7>

Describe the shape / types of the data you are training on

Please provide any tracebacks or error messages you are receiving

Incorrect batch sizes created in certain cases

When a batch size is not evenly divisible into the number of columns, the floor of the divisor is being used to calculate the number of batches, resulting in fewer batches, with larger batch sizes than expected.

For example, a dataset with 59 columns and a batch_size selected of 15 columns will result in 3 batches of [20, 20, 19] columns, vs. [15, 15, 15, 14] which is expected.

TypeError: init() got an unexpected keyword argument 'prefetch_factor'

Hi gretelai developers, thanks for implementing this amazing work in Pytorch. I am currently researching on this subject and was trying out the code in the notebook example.

When running the example notebook timeseries_dgan.ipynb I encountered the following error when running model.train_numpy TypeError: init() got an unexpected keyword argument 'prefetch_factor'. How do i fix this error

Size Limits

Are there limits to the number of columns in the dataframe that can be passed to train_rnn?

If I run the code with just 5 columns it works, but fails if I pass all 1500 columns. Wondering where the limit is exactly and why it's there.

Logging the Performance of Time series DGAN,

Hello, Thank you very much for sharing your excellent work. I would like to know how I can log the learning graphs of the DGAN.

Cheers,

TypeError: init() got an unexpected keyword argument 'field_delimiter'

TypeError: init() got an unexpected keyword argument 'field_delimiter'

config = LocalConfig(
    max_lines=0, # read all lines (zero)
    epochs=15, # 15-30 epochs for production
    vocab_size=200, # tokenizer model vocabulary size
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=10000, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    save_all_checkpoints=False,
    field_delimiter= ",",
    input_data_path=annotated_file # filepath or S3
)

[BUG] : Loading a trained model and generating synthetic data throws an error

Are you reporting a bug or FR?

Bug
Feature Request

What version of synthetics are you using?
0.20.0

What problem are you having?
I am training a Timeseries DGAN model using the timeseries code provided by Gretel-synthetics library.
I have trained the model and saved it using model. Save(path) provided in the dgan.py file. I am trying to load the file using load function in the same dgan.py. Aftaer loading trying to generate synthetic data using generate_numpy()
It fails with Attribute error 'DGAN' object has no attribute 'attribute_noise_func'

Configuration Params

max_sequence_len=2000,
sample_len=1,
batch_size=128,
epochs=1000, # For real data sets, 100–1000 epochs is typical
feature_num_layers=5,
attribute_discriminator_learning_rate = 0.001,
feature_num_units = 100 ,
generator_learning_rate = 0.001 ,
discriminator_learning_rate= 0.001

Are you using GPU or a CPU?
GPU

What environment are you working in?

Tried in both Jupyter & Google Colab, got the same error

What version of python are you using?

Tried in bth 3.8 (Colab)& 3.9(Jupyter)

Describe the shape / types of the data you are training on
(42000,31)
Please provide any tracebacks or error messages you are receiving

  471                 internal_data_list.append(
    472                     self._generate(self.attribute_noise_func(self.config.batch_size),
--> 473                                    self.feature_noise_func(self.config.batch_size),))
    474 
    475             # Convert from list of tuples to tuple of lists with zip(*) and

AttributeError: 'DGAN' object has no attribute 'attribute_noise_func'

batch.generate_all_batch_lines is not using the `max_invalid` parameter

When generating synthetic data via the batch interface, generate_all_batch_lines exits after the default of 1000 invalid records, vs the count specified in max_invalid.

[BUG] `gast` minimum version incompatible with TensorFlow 2.4.x default

Are you reporting a bug or FR?

[ X ] Bug
Feature Request

What version of synthetics are you using?
0.15.0

What would you like to see / What problem are you having?
Gretel-synthetics defines a gast verison of gast==0.4.0, which is ahead of gast==0.3.3 that is currently pinned to the TensorFlow 2.4.x release. This results in the following warnings while running training in differentially private mode.

Epoch 1/100

WARNING:tensorflow:Converting IndexedSlices(indices=Tensor("gradient_tape/sequential/embedding/embedding_lookup/Reshape_1:0", shape=(1600,), dtype=int32), values=Tensor("gradient_tape/sequential/embedding/embedding_lookup/Reshape:0", shape=(1600, 256), dtype=float32), dense_shape=Tensor("gradient_tape/sequential/embedding/embedding_lookup/VariableShape:0", shape=(2,), dtype=int32)) to a dense representation may make it slow. Alternatively, output the indices and values of the IndexedSlices separately, and handle the vectorized outputs directly.
2020-11-18 22:10:05,297 : MainThread : WARNING : Converting IndexedSlices(indices=Tensor("gradient_tape/sequential/embedding/embedding_lookup/Reshape_1:0", shape=(1600,), dtype=int32), values=Tensor("gradient_tape/sequential/embedding/embedding_lookup/Reshape:0", shape=(1600, 256), dtype=float32), dense_shape=Tensor("gradient_tape/sequential/embedding/embedding_lookup/VariableShape:0", shape=(2,), dtype=int32)) to a dense representation may make it slow. Alternatively, output the indices and values of the IndexedSlices separately, and handle the vectorized outputs directly.
WARNING:tensorflow:AutoGraph could not transform <function WhileV2.__call__.<locals>.while_fn at 0x7fe4a013a6a8> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'

Gast is not called explicitly by gretel-synthetics, and removing the dependency fixes the warning.

Configuration Params

Running the default diff_privacy.ipynb example notebook

Are you using GPU or a CPU?

GPU

What environment are you working in?

Colab

What version of python are you using?

3.75

Describe the shape / types of the data you are training on
Provided sample data (netflix)

Please provide any tracebacks or error messages you are receiving
Above

[BUG] example notebook error

What version of synthetics are you using?

0.21.0

What would you like to see / What problem are you having?

I created a new project, fresh conda environment with python==3.9.16

Then I copied this notebook:
https://github.com/gretelai/gretel-synthetics/blob/master/examples/synthetic_records.ipynb

And as per the notebook instructions, installed only:

gretel-synthetics[tf]
jupyter

It fails on the "train a model" (i.e. second code cell) with:

ImportError: cannot import name 'range_op' from 'tensorflow.python.data.ops'

Perhaps the insutrctions in the notebook is at odds with the instruction in the script which recommends tensorflow==2.8.

I tried installing tensorflow==2.8 in addition to the aforementioned gretel-synthetics[tf] and jupyter, but this time running the first cell fails with:

ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.11; Detected an installation of version 2.8.0. Please upgrade TensorFlow to proceed.

Structure batch module for separate test / generate

Make some interface changes so the batch module can be used as separate train / test

Make input dataframe optional. If an input DF is not set, any training methods will throw a RunTimeError
When generating for a single batch, catch the RunTimeError that will be thrown by generate_lines() and simply always return a True/False from generate_batch_lines, this way if there's a bad batch, you can simply introspect which batches failed, then re-train/generate for those batches only

Batch Size Question

Sorry for spamming the issues page. If there is a better way to communicate (slack, etc) let me know.

As I've mentioned, my dataset is about 46K rows X 1500 columns. I have the batch size set at 64. Doesn't this mean the number of batches is 46k/64 = 718? How do I end up with the 17K batches? Screenshot below.

I attached the code below but it's the same as previous threads.

config = LocalConfig(
    max_lines=0, # read all lines (zero)
    epochs=15, # 15-30 epochs for production
    vocab_size=2000, # tokenizer model vocabulary size
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    max_line_len=55000,  # the max line length SentencePiece will consider
    gen_lines=10000, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    save_all_checkpoints=False,
    field_delimiter=",",
    input_data_path=annotated_file # filepath or S3
)

train_rnn(config)

Simplify / minimize default parameters in Gretel Synthetics notebook

Simplify default configuration parameters by removing less commonly used configuration parameters from the example notebook.

Current configuration:

config = LocalConfig(
    max_lines=0, # use max_lines of training data. Set to 0 (zero) to on all lines in dataset
    epochs=15, # 15-50 epochs with GPU for best performance
    vocab_size=15000, # tokenizer model vocabulary size
    max_line_len=2048,  # the max line length SentencePiece will consider
    character_coverage=1.0, # tokenizer model character coverage percent
    gen_chars=0, # the maximum number of characters possible per-generated line of text
    gen_lines=100, # the number of generated text lines
    rnn_units=256, # dimensionality of LSTM output space
    batch_size=64, # batch size
    buffer_size=1000, # buffer size to shuffle the dataset
    dropout_rate=0.2, # fraction of the inputs to drop
    dp=True, # let's use differential privacy
    dp_learning_rate=0.015, # learning rate
    dp_noise_multiplier=1.1, # control how much noise is added to gradients
    dp_l2_norm_clip=1.0, # bound optimizer's sensitivity to individual training points
    dp_microbatches=256, # split batches into minibatches for parallelism
    checkpoint_dir=(Path.cwd() / 'checkpoints').as_posix(),
    field_delimiter=",",  # if the training text is structured
    # overwrite=True,  # enable this if you want to keep training models to the same checkpoint location
    input_data_path="https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/uber_scooter_rides_1day.csv" # filepath or S3
)

[BUG] train_numpy() got multiple values for argument 'feature_types' - dgan

I was trying to run the example code of dgan. But I keep getting the same errors and can't really understand where it's going wrong...

The code comes from: https://github.com/gretelai/public_research/blob/main/oss_doppelganger/sample_usage.ipynb

import numpy as np
import pandas as pd

from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig, OutputType, Normalization

attributes = np.random.randint(0, 3, size=(1000,3))
features = np.random.random(size=(1000,20,2))

model = DGAN(DGANConfig(
    max_sequence_len=20,
    sample_len=4,
    batch_size=1000,
    epochs=10,  # For real data sets, 100-1000 epochs is typical
))

model.train_numpy(
    attributes, features,
    attribute_types = [OutputType.DISCRETE] * 3,
    feature_types = [OutputType.CONTINUOUS] * 2
)

synthetic_attributes, synthetic_features = model.generate_numpy(1000)

This code produces the following error:

~\AppData\Local\Temp\ipykernel_19824\3379362806.py in <module>
     19     attributes, features,
     20     attribute_types = [OutputType.DISCRETE] * 3,
---> 21     feature_types = [OutputType.CONTINUOUS] * 2
     22 )
     23 

TypeError: train_numpy() got multiple values for argument 'feature_types'

Miniconda3 with following libs installed with pip (used with jupyter notebook wihtinn VScode)

python: 3.7.15
greta-synthetics: 0.19.0
numpy: 1.21.6

[FR / BUG]

Are you reporting a bug or FR?

[*] Feature Request

What version of synthetics are you using?

0.19.0

What would you like to see / What problem are you having?

An evaluation module is highly required

It is both hard and confusing to compare the correlation of the real data against the synthetic data.
This is a very straightforward and easy-to-use library thus it makes it possible for everyone to use it whether they have a solid
statistical and ML background or not. So availability of an evaluation module is highly required because most people can't trust the synthetic data that they create using this library because they lack the technical ability to correctly evaluate the data.

Configuration Params

Are you using GPU or a CPU?

GPU

What environment are you working in?

Google Colab

What version of python are you using?

3.8

Describe the shape / types of the data you are training on

TimeSeries

Please provide any tracebacks or error messages you are receiving

Extend native CSV support to TSV, pipe, other common delimited formats

The sentencepiece tokenizer has optimizations to support comma delimited formats. However other training formats, such as the Glue/SQUAD/MRPC data formats use TSV. Build in native support for these formats into the tokenizer and text generation libraries.

Speed up synthetic data generation

Currently, text generation only utilizes about 30% of the GPU capacity on a Nvidia Tesla P4/T4. Look into ways to optimize any bottlenecks and speed up text generation.

Note: Text generation per line may not be very parallelize-able (each next step requires the output of the previous step), but it may be possible to generate multiple outputs at the same time.

Results about DGAN

Hello, I got a good inspiration from Gretel. I encountered several problems in my work.
As we all know, DGAN model needs to input multiple time series data, and the output is also multiple time series data. (Once I thought DGAN could generate a long series of time series).
My downstream task is to use the generated data to expand the original data. I made statistics on multiple time series data generated by DGAN, and found that the distribution of these data is very different from the original data, which is different from what I expected. (I hope that after merging the generated multiple time series data, it still conforms to the distribution of the original data)
I saw in Gretel's blog that the distribution of raw data and generated data match, and the results in the blog are also the results I want to achieve.
【 https://gretel.ai/blog/creating-synthetic-time-series-data
https://github.com/gretelai/gretel- 】
【synthetics/blob/master/examples/timeseries_dgan.ipynb】
I look forward to your guidance and reply.

Config param docstrings

Create detailed docstrings for every possible param to the configs, right now we don't have any guidance on how they impact training / generation

Performance issues in src/gretel_synthetics/tensorflow/train.py(P2)

Hello,I found a performance issue in the definition of _create_dataset ,
src/gretel_synthetics/tensorflow/train.py,
full_dataset = sequences.map was called without num_parallel_calls.
I think it will increase the efficiency of your program if you add this.

The same issues also exist in .map(recover) ,
.map(recover)

Here is the documemtation of tensorflow to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Elaborate inner works of the modelling

Hi, Nice work! Really nice blogpost and repo. Since this is an area of interest of mine, I was wondering if you could elaborate a bit on some aspects.

As far as I see, as encoding for the data you are using some form of sentencepiece, which I interpret as that you encode each table cell as a token for the encoder. Then, in the model, this token is mapped to a single embedding layer for all information. Correct?

How is training done? Does it just try to predict the next token at each timestep (also for each row)?

Now, i'm not extremely familiar with differential privacy, but I think its very nice that you have that built-in. My questions would be that since also continuous values are encoded as tokens, is there not a high chance of getting duplicate rows as in the train set? Specifically with smaller datasets like the Heart Disease dataset you use in the blogpost. And with these small datasets, do you see a large difference in results between training with and without dp?

Lastly, how do you evaluate the similarity between your real and fake data? Did you do in depth analysis of why the results of some of these classifiers were higher?

gretel not finding GPU

Hi. I've been trying to run the examples but I keep getting a warning that the GPU is not found. !nvidia-smi does show the GPU however. This is all a fresh install within a conda environment.

conda create --name gretel --python=3.8
conda activate gretel
pip install gretel-synthetics[tf]

Thanks for the help,

[BUG] Incompatability with package dependence

Are you reporting a bug or FR?

[ x] Bug

What version of synthetics are you using?

v0.17.0

What would you like to see / What problem are you having?

I am wondering if you have any idea of where this error is coming from and if this versioning warning poses any issue for the functionality of the code.

Configuration Params

Unsure

Are you using GPU or a CPU?

CPU

What environment are you working in?

Google Colab on the drive (not locally

What version of python are you using?

3.7

Describe the shape / types of the data you are training on

Netflix Prize Data

24053764 x 4, without headings.

Please provide any tracebacks or error messages you are receiving
Please see above

Question on continuous values

Hi. I'm wondering how Gretel deals with continuous values in the input data. I can see how categorical values can be similar to words in a text but what about continuous ones?

Thanks

[FR] Generation based on given attributes

Hi,

I wondered if it is already possible to generate synthetic data based on given values from attributes? I remember that the original paper allows this?

Otherwise I might further look into this to implement it myself

Kind regards!

Poor training results

First of all, thank you for your research results! This is very helpful for my current research. I ran the code in "timeseries _dgan. ipynb", and the running results are what I need for my current research. Therefore, I try to perform the same processing on my dataset, but the distribution of generated data and original data is quite different. I would like to ask you what caused this and what direction I should improve.
As shown in the figure, my dataset has a total of 37 attributes, all of which are discrete data. After visualization, I found that only 1-3 attributes met my expectations.

DGAN for ECG dataset

Hello,

I have been trying to apply the DoppleGANger (DGAN) model on my 1-lead ECG dataset to generate synthetic data but after some tries and tuning some basic hyperparameters, the model does not learn the pattern of the ECG.
So, I just wanted to make sure if DGAN is even applicable for a Biosignal or ECG data generation.
Any suggestions are welcome!

Thank you in advance.

NameError: name 'line' is not defined

This piece of code errors. I assume line should be replaced with record- is that correct?

for idx, record in enumerate(generate_text(config, line_validator=validate_record)):
    status = record.valid
    
    # ensure all generated records are unique
    synth_df = synth_df.drop_duplicates()
    synth_cnt = len(synth_df.index)
    if synth_cnt > records_to_generate:
        break 

    # if generated record passes validation, save it
    if status:
        print(f"({synth_cnt}/{records_to_generate} : {status})")        
        print(f"{line.text}")
        data = line.values_as_list()
        synth_df = synth_df.append({k:v for k,v in zip(df.columns, data)}, ignore_index=True)

[BUG] ModuleNotFoundError: No module named 'gretel_helpers'

Are you reporting a bug or FR?

What version of synthetics are you using?

0.20.0

What would you like to see / What problem are you having?

I'm unable to import gretel in my code.
from gretel_helpers.synthetics import create_df, SyntheticDataBundle gives me import error.

Are you using GPU or a CPU?

CPU

What environment are you working in?

VSCode

What version of python are you using?

3.10.14

Please provide any tracebacks or error messages you are receiving

 from gretel_helpers.synthetics import create_df, SyntheticDataBundle
ModuleNotFoundError: No module named 'gretel_helpers'

About DoppelGANger training results

Hello, I am very lucky to get inspiration from Gretel. I have encountered some questions. I hope you can help me answer them.
As we all know, the DoppelGANger model in Gretel is used to generate time series models. Let me introduce my question by analogy.
I have a time series data set. The "Date" column ranges from January 1, 2000 to December 31, 2000. The data is recorded daily. According to the input format required by DoppelGANger, 3D data (*, , man_sequence_len) needs to be segmented according to "max_sequence_len"
After model training, the generated data is also three-dimensional (, *, man_sequence_len), such as max_ sequence_ Len=5, the "Date" column of all the data I generated will be repeated between January 1, 2000 and January 5, 2005, and have an "example_id" column.
My question is as follows:

Does the data in the same "example_id" column represent the trend of the entire original data? If not, what does he represent?
Can I restore the "Date" column of the generated data to the "Date" column of the original data (a complete time series)?

Batch module

Create an experimental module that does batch training / generation based on feature counts. This is separate from the RNN batch size, but instead we will break the training data down by clusters of columns and train independently on those.

Data Storage

Hi -

Can you confirm no data passed to gretel-synthetics is stored anywhere?

Thanks,
Zak

Update gen_lines behavior

Allow the gen_lines param in the config to specify how many valid lines to generate

Proposed behavior:

if no validator is used, generate exactly gen_lines
if a validator is used, generate gen_lines number of valid lines, with a configurable max number of invalid lines to tolerate. when the number of invalid lines is reached, stop generating

timeseries_dgan.ipynb example - error from train_numpy

Hellow,

Trying to run the example in the notebook this part results in an error

model.train_numpy(
    features,
    feature_types=[OutputType.CONTINUOUS] * features.shape[2],
)

File "C:\Users\Ran\anaconda3\envs\eeg_gan_v6\lib\site-packages\gretel_synthetics\timeseries_dgan\dgan.py", line 186, in train_numpy if attributes.shape[0] != features.shape[0]:
AttributeError: 'NoneType' object has no attribute 'shape'

I managed to make it run by simply declaring and passing attributes:

attributes = np.zeros(features.shape[0], 1)
model.train_numpy(
    features,
    feature_types=[OutputType.CONTINUOUS] * features.shape[2],
    attributes=attributes
)

So:

Either this argument isn't optional or I have made a mistake :)
Can you please elaborate more on what exactly attributes are? are they metadata/labels? I'm not entirely sure about the documentation.

Thank you

Sample_len Value

Are you reporting a bug or FR?
No.

What version of synthetics are you using?

What is the variable "Sample_len" and what does its value stand for? How is this value chosen?

Am using GPU

What environment are you working in?

What version of python are you using?

<3.9>

Describe the shape / types of the data you are training on

<15,884 X 11, floating numbers> daily data and each entry represents a day.

Also is there any published paper describing this work/approach? If yes, please provide link to paper

I will be very grateful for your prompt response

Duplicates present in Unique columns

Hi I have used template generating synthetic data from your own csv. I have given input as 1000 records and output also 1000
attached the input and output. In the output the column ref_id is primary key however it is generating some duplicates. Is there any configuration to be set so that it will generate unique values in primary key columns.
file.zip

[BUG]: Outdated category_encoders

Are you reporting a bug or FR?

Bug
Solved by upgrading category_encoders:

pip uninstall category_encoders
pip install category_encoders

What version of synthetics are you using?
0.20.0

What would you like to see / What problem are you having?
Problem with category_encoders version. Current version in gretel-synthetics is 2.2.2. Latest version is 2.6.0 (https://github.com/scikit-learn-contrib/category_encoders).

Configuration Params


from gretel_synthetics.timeseries_dgan.config import DGANConfig

config = DGANConfig(
    max_sequence_len=max_days,
    sample_len=1,
    generator_learning_rate=1e-4,
    discriminator_learning_rate=1e-4,
    epochs=epochs
)

model = DGAN(config)

model.train_dataframe(
    df = real_df_truc,
    example_id_column = id_col,
    feature_columns = feature_cols,
    attribute_columns = attribute_cols,
    time_column = time_col,
    df_style = 'long',
)

Are you using GPU or a CPU?
GPU

What environment are you working in?
Jupyter

What version of python are you using?
3.8.8

Describe the shape / types of the data you are training on
1 example ID column
3 feature columns, all numerical
10 attribute columns, mix of categorical and numerical
1 time column, in YYYY-MM-DD

Please provide any tracebacks or error messages you are receiving

2023-04-13 23:50:09,577 : MainThread : INFO : Marking column XXX as discrete because its type is string/object.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed eval> in <module>

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/dgan.py in train_dataframe(self, df, attribute_columns, feature_columns, example_id_column, time_column, discrete_columns, df_style, progress_callback)
    393         attributes, features = self.data_frame_converter.convert(df)
    394 
--> 395         self.train_numpy(
    396             attributes=attributes,
    397             features=features,

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/dgan.py in train_numpy(self, features, feature_types, attributes, attribute_types, progress_callback)
    238 
    239         if not self.is_built:
--> 240             attribute_outputs, feature_outputs = create_outputs_from_data(
    241                 attributes,
    242                 features,

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/transformations.py in create_outputs_from_data(attributes, features, attribute_types, feature_types, normalization, apply_feature_scaling, apply_example_scaling, binary_encoder_cutoff)
    399             )
    400         attribute_types = cast(List[OutputType], attribute_types)
--> 401         attribute_outputs = [
    402             create_output(
    403                 index,

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/transformations.py in <listcomp>(.0)
    400         attribute_types = cast(List[OutputType], attribute_types)
    401         attribute_outputs = [
--> 402             create_output(
    403                 index,
    404                 t,

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/transformations.py in create_output(index, t, data, normalization, apply_feature_scaling, apply_example_scaling, binary_encoder_cutoff)
    486         raise RuntimeError(f"Unknown output type={t}")
    487 
--> 488     output.fit(data.flatten())
    489 
    490     return output

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/transformations.py in fit(self, column)
     41             raise ValueError("Expected 1-d numpy array for fit()")
     42 
---> 43         self._fit(column)
     44         self.is_fit = True
     45 

~/.local/lib/python3.8/site-packages/gretel_synthetics/timeseries_dgan/transformations.py in _fit(self, column)
    123         self._encoder = OneHotEncoder(cols=0, return_df=False)
    124 
--> 125         self._encoder.fit(column)
    126 
    127     def _transform(self, column: np.ndarray) -> np.ndarray:

~/.local/lib/python3.8/site-packages/category_encoders/one_hot.py in fit(self, X, y, **kwargs)
    149             handle_missing='value'
    150         )
--> 151         self.ordinal_encoder = self.ordinal_encoder.fit(X)
    152         self.mapping = self.generate_mapping()
    153 

~/.local/lib/python3.8/site-packages/category_encoders/ordinal.py in fit(self, X, y, **kwargs)
    131             self.cols = util.get_obj_cols(X)
    132         else:
--> 133             self.cols = util.convert_cols_to_list(self.cols)
    134 
    135         if self.handle_missing == 'error':

~/.local/lib/python3.8/site-packages/category_encoders/utils.py in convert_cols_to_list(cols)
     19     elif isinstance(cols, tuple):
     20         return list(cols)
---> 21     elif pd.api.types.is_categorical(cols):
     22         return cols.astype(object).tolist()
     23 

AttributeError: module 'pandas.api.types' has no attribute 'is_categorical'

Bug

Are you reporting a bug or FR?

I'm trying to import ACTGAN but I'm facing this error!

"No module named 'rdt'"

How come I'm unable to load the library.

List index out of range

Hi i am getting this error while training on a financial dataset which I have. I was trying to play with the hyper parameters , to generate the data, where I am not necessarily getting good results. But when I increased the generator rounds and discriminator rounds to 2, After few epochs of training (22-24), this error pops up.
It seems the len of inputs becomes 0 from 3 for that particular epoch. How do I fix this issue? Also what parameters do you recommend to train a bank transactional data and generate them? (For many attributes and features)

File ~/anaconda3/envs/dgtpy39env/lib/python3.9/site-packages/gretel_synthetics│/timeseries_dgan/dgan.py:851, in DGAN._discriminate(self, batch)
849 # Flatten the features
850 print("Hi ",len(inputs))
851 inputs[-1] = torch.reshape(inputs[-1], (inputs[-1].shape[0], -1))
853 input = torch.cat(inputs, dim=1)
855 output = self.feature_discriminator(input)

IndexError: list index out of range

Somehow the data is getting changed or something? Please help.

I am able to run upto 1200 epochs easily without playing with the generator and discriminator rounds parameter.

UnicodeDecodeError when training synthetics on Kaggle dataset

Running on a downloaded Kaggle dataset: /home/amy/sdata/US_household_income_NH.csv (my cloud server), I get the error:

~/env/ml/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 4016: invalid continuation byte

Running iconv -c -f utf-8 -t ascii on the file resolves the error

gretelai / gretel-synthetics Goto Github PK

gretel-synthetics's Introduction

Gretel Synthetics

Documentation

Try it out now!

Getting Started

Dependency Requirements

Timeseries DGAN Overview

ACTGAN Overview

LSTM Overview

Simple Mode

DataFrame Mode

Components

Utilities

Differential Privacy

gretel-synthetics's People

Contributors

Stargazers

Watchers

Forkers

gretel-synthetics's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs