sdv-dev / sdgym Goto Github PK

Benchmarking synthetic data generation methods.

License: Other

Python 95.62% Makefile 4.03% Dockerfile 0.34%

synthetic-data benchmark tabular-data synthetic-data-vault sdgym-synthesizers generative-adversarial-network deep-learning generative-ai generative-models

sdgym's Introduction

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Overview

The Synthetic Data Gym (SDGym) is a benchmarking framework for modeling and generating synthetic data. Measure performance and memory usage across different synthetic data modeling techniques – classical statistics, deep learning and more!

The SDGym library integrates with the Synthetic Data Vault ecosystem. You can use any of its synthesizers, datasets or metrics for benchmarking. You can also customize the process to include your own work.

Datasets: Select any of the publicly available datasets from the SDV project, or input your own data.
Synthesizers: Choose from any of the SDV synthesizers and baselines. Or write your own custom machine learning model.
Evaluation: In addition to performance and memory usage, you can also measure synthetic data quality and privacy through a variety of metrics.

Install

Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdgym

conda install -c pytorch -c conda-forge sdgym

For more information about using SDGym, visit the SDGym Documentation.

Usage

Let's benchmark synthetic data generation for single tables. First, let's define which modeling techniques we want to use. Let's choose a few synthesizers from the SDV library and a few others to use as baselines.

# these synthesizers come from the SDV library
# each one uses different modeling techniques
sdv_synthesizers = ['GaussianCopulaSynthesizer', 'CTGANSynthesizer']

# these basic synthesizers are available in SDGym
# as baselines
baseline_synthesizers = ['UniformSynthesizer']

Now, we can benchmark the different techniques:

import sdgym

sdgym.benchmark_single_table(
    synthesizers=(sdv_synthesizers + baseline_synthesizers)
)

The result is a detailed performance, memory and quality evaluation across the synthesizers on a variety of publicly available datasets.

Supplying a custom synthesizer

Benchmark your own synthetic data generation techniques. Define your synthesizer by specifying the training logic (using machine learning) and the sampling logic.

def my_training_logic(data, metadata):
    # create an object to represent your synthesizer
    # train it using the data
    return synthesizer

def my_sampling_logic(trained_synthesizer, num_rows):
    # use the trained synthesizer to create
    # num_rows of synthetic data
    return synthetic_data

Learn more in the Custom Synthesizers Guide.

Customizing your datasets

The SDGym library includes many publicly available datasets that you can include right away. List these using the get_available_datasets feature.

sdgym.get_available_datasets()

dataset_name   size_MB     num_tables
KRK_v1         0.072128    1
adult          3.907448    1
alarm          4.520128    1
asia           1.280128    1
...

You can also include any custom, private datasets that are stored on your computer on an Amazon S3 bucket.

my_datasets_folder = 's3://my-datasets-bucket'

For more information, see the docs for Customized Datasets.

What's next?

Visit the SDGym Documentation to learn more!

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

sdgym's People

Contributors

Stargazers

Watchers

sdgym's Issues

Typos in the sdgym.run function docstring

SDGym version: 0.3.0

Description

There are some typos in docstring of the sdgym.run function
This is the exact url where there is typo -> Link

The argument show_progress defaults to False but mentioned that it defaults to True.
The argument iterations defaults to 1 but mentioned that it defaults to 3.

Cap development versions

Docs build got broken after Sphinx 3.0.0 was released.

The main dependencies are all already capped to the next major (minor if 0.x) to improve robustness against API changes, but development and test versions are not.

Let's cap them as well.

when I use 'benchmark(synthesizer.fit_sample)', KeyError: 'structure' happens.

Traceback (most recent call last):
File "", line 1, in
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/benchmark.py", line 38, in benchmark
scores = evaluate(train, test, synthesized, meta)
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/evaluate.py", line 366, in evaluate
performance = evaluator(synthesized_data, test, metadata)
File "/home/zhengxr/anaconda3/lib/python3.7/site-packages/sdgym/evaluate.py", line 297, in _evaluate_bayesian_likelihood
structure_json = json.dumps(metadata['structure'])
KeyError: 'structure'

Loss of generator keeps increasing

On a testing dataset of mine the loss_g keeps increasing.
Also loss_mean and loss_std do not seem to converge even after hundreds of epochs

Increase code style lint

Problem Description

Currently our code is being validated only by flake8 'vanilla' and just a few plugins. We would like to increase the code style checks by adding more add-on's that follow our code style and our standards.

Also we would like to ensure that our docstrings are properly written and follow the rest of our format.

Additional context

We have performed this task already on RDT , more precisely on the following issue:
sdv-dev/RDT#248 (comment)

Docstring plugin

We need to add pydocstyle plugin with the following lines on our setup.cfg file as we are following the google convention.

[pydocstyle]
convention = google
add-ignore = D107, D407, D417

Flake8 plugins to be added

Flake8 comes with a lot of different addons that we can use to adapt it to our codestyle and checking, here is a list of plugins that I found to be interesting for us:

flake8-builtins - Check for python builtins being used as variables or parameters.
flake8-comprehensions - Helps you write better list/set/dict comprehensions.
flake8-debugger - Debug statement checker.
flake8-variables-names - Extension that helps to make more readable variables names.
Dlint - Tool for encouraging best coding practices and helping ensure Python code is secure.
flake8-mock - Provides checking mock non-existent methods.
flake8-fixme - Check for FIXME, TODO and other temporary developer notes.
flake8-eradicate - Plugin to find commented out or dead code.
flake8-mutable - Extension for mutable default arguments.
flake8-print - Check for print statements in python files.
flake8-pytest-style - Plugin checking common style issues or inconsistencies.
flake8-quotes - Extension for checking quotes in python.
flake8-multiline-containers - Plugin to ensure a consistent format for multiline containers.
pandas-vet - Plugin that provides opinionated linting for pandas code.
pep8-naming - Check the PEP-8 naming conventions.
flake8-expression-complexity - Plugin to validate expressions complexity.
flake8-sfs - String formatting.

Reproducibility

SDGym version: 0.4.2
Python version: 3.6
Operating System: Ubuntu

Description

I am going to reproduce all results reported in the CTGAN paper.
However, I cannot fully reproduce the reported results:

For CTGAN, by running the below code, I can rarely reproduce the same results.
For the credit dataset, it seems that the dataset sdgym package is not the same as that reported in the paper.

What I Did

For reproducing, I follow the demo:

import sdgym
from sdv.tabular import GaussianCopula, CTGAN
from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, Identity, Independent,
    MedGAN)

scores = sdgym.run(synthesizers=CTGAN, datasets=['asia'])
scores = sdgym.run(synthesizers=Identity, datasets=['credit'])

Store SDGym results into an S3 bucket

SDGym version: 0.3.0

Description

Current SDGym implementation allows to produce a results table as either a DataFrame (when run from python) or as a CSV file stored in the local HDD.

It should also be possible to store the results in an S3 bucket, which would be triggered by passing an output_path that contains the S3 prefix:

Python:

sdgym.run(..., output_path='s3://my-bucket/path/to/my/results.csv')

CLI

sdgym run ... -o s3://my-bucket/path/to/my/results.csv

If the bucket is private, the AWS key and secret introduce in PR #74 should be used.

Error benchmarking Tablegan on the Intrusion Dataset

An error shows up Tablegan is benchmarked on the intrusion dataset.

This is the shown traceback.

2020-05-11 22:09:26,240 - INFO - base - Sampling TableganSynthesizer
2020-05-11 22:09:28,757 - ERROR - benchmark - Error computing scores for TableganSynthesizer on dataset intrusion - iteration 0
Traceback (most recent call last):
  File "/home/xals/Projects/SDGym/sdgym/benchmark.py", line 71, in compute_benchmark
    scores = compute_scores(train, test, synthesized, meta)
  File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 358, in compute_scores
    scores = evaluator(synthesized_data, test, metadata)
  File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 162, in _evaluate_multi_classification
    x_train, y_train, x_test, y_test, classifiers = _prepare_ml_problem(train, test, metadata)
  File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 143, in _prepare_ml_problem
    x_train, y_train = fm.make_features(train)
  File "/home/xals/Projects/SDGym/sdgym/evaluate.py", line 132, in make_features
    feature = encoder.fit_transform(col)
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 631, in fit_transform
    return self.fit(X).transform(X)
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 493, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 80, in _fit
    X_list, n_samples, n_features = self._check_X(X)
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 49, in _check_X
    X_temp = check_array(X, dtype=None)
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/home/xals/.virtualenvs/SDGym/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Benchmark IndependentSynthesizer raises ValueError

IntependentSynthesizer raises a ValueError with the intrusion dataset

What I did:

In [1]: from sdgym import benchmark                                                                  

In [2]: from sdgym.synthesizers import IndependentSynthesizer, MedganSynthesizer, VEEGANSynthesizer  

In [3]: independent = IndependentSynthesizer()                                                       

In [4]: benchmark(independent.fit_sample, datasets=['intrusion'])                                    
INFO - Evaluating dataset intrusion
INFO - Fitting IndependentSynthesizer
INFO - Sampling IndependentSynthesizer
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-00ab287a3ba6> in <module>
----> 1 benchmark(independent.fit_sample, datasets=['intrusion'])

~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
     35 
     36         for iteration in range(repeat):
---> 37             synthesized = synthesizer(train, categoricals, ordinals)
     38             scores = evaluate(train, test, synthesized, meta)
     39             scores['dataset'] = name

~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
     18 
     19         LOGGER.info("Sampling %s", self.__class__.__name__)
---> 20         return self.sample(data.shape[0])

~/Projects/SDGym/sdgym/synthesizers/independent.py in sample(self, samples)
     38                 data[:, i] = data[:, i].clip(info['min'], info['max'])
     39             else:
---> 40                 data[:, i] = np.random.choice(np.arange(info['size']), samples, p=self.models[i])
     41 
     42         return data

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: 'a' and 'p' must have same size

The problem is that is trying to access with the size thats inside info instead of the actual size of the models.

Removal of RNN and attention in TGAN

Hi guys, sorry for bothering you again!

I noticed the TGAN model was changed, removing the recurrent structure as well as the attention layer. I was wondering why these were made, since the original performed very well. Was it due to space/time limitations in this new setup? Or did you also see improved results with these changes?

F-score is ill-defined and being set to 0.0 & won't converge

I don't understand why this is happening, since i followed the instructions on readme file.
Any suggestions?

https://colab.research.google.com/drive/1sBFKvFy5D_ssGtIVMmm6QiKl8H7NXEBm

#evaluating performance of build in synthesizer
synthesizer = IndependentSynthesizer()
benchmark(synthesizer.fit_sample)

/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.

Need Help on benchmark function for real data

benchmark function requires a my_synthesizer_function which takes input real data, categorical, ordinal features and make output of synthesized data. Though the documentation provided is not sufficient for a novice like me and hence facing issue in implementing and moreover in benchmark function it's showing up that it is taking data from predefined defult_datasets which has its own metdata file stored in server in json format, hence not allowing me to benchmark on my data as I don't have metadata ready for my data sets, there are quite a few and they are large.

so any detailed documentation on how to use this benchmark function more efficiently will be helpful.
Thanks a lot for such a beautiful package.
I am new to this domain

Failing for custom dataset

This is the error message that we get while using the evaluation function to get the scores. The command that we use is: scores = evaluate(train, test, sampled, meta)
Synthesizer being used: UniformSynthesizer

This error is for IndependentSynthesizer

I'm facing some issues when I use a custom dataset. Can you help me with this?

Store cache contents into an S3 bucket

SDGym version: 0.3.0

Description

Similarly to what is described on #80, it should be possible to store the cache contents into an S3 bucket.

The behavior would be similar to the results path, where one can specify the s3:// prefix in the cache_dir path specification to trigger the S3 storage.

Python

sdgym.run(..., cache_dir='s3://my-bucket/path/to/my/cache/dir')

CLI

sdgym run ... -c s3://my-bucket/path/to/my/cache/dir

The collect command introduced in the PR #78 should also be adapted to read the cache contents from S3 and store the resulting CSV file to S3.

Unable to list all the datasets available in sdgym

SDGym version: 0.3.0
Python version: 3.6 and 3.7
Operating System: Ubuntu 18.04.5 LTS

Description

Issue in displaying all the datasets available in sdgym.
The problem was discussed in slack with @csala. The link for the discussion in the slack is here

AttributeError: module 'sdmetrics' has no attribute 'single_table' when importing SDgym

SDGym version: 0.3.1
Python version: 3.7.4
Operating System: Mac OS Mojave 10.14.6

Description

Tried to import SDgym into a jupyter notebook for the first time after install.

Error message: "AttributeError: module 'sdmetrics' has no attribute 'single_table'"

sdmetrics is fully installed with latest version

ConvergenceWarning: Initialization 1 did not converge

SDGym version: sdgym == 0.2.1
Python version: Python ==3.6.10
Operating System: Windows 10

Description

Thanks very much for open-sourcing the code.

While running the script provided in the benchmarking documentation as follows:

from sdgym import benchmark
from sdgym.synthesizers import CTGANSynthesizer
leaderboard = benchmark(synthesizers=CTGANSynthesizer)

I encountered the following ConvergenceWaring:

ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.% (init + 1), ConvergenceWarning)

Could you please tell me whether is it OK to ignore the ConvergenceWaring above ? Or give me some other suggestions?
Thank you in advance！

Add project scaffolding

Add the project scaffolding from DAI-Lab cookiecutter

Losses become nan after some number of epochs

I've been using your nicely implemented TableGAN and TGAN for some training, but after 10-15 my loss becomes Nan.

12 step 1200
tensor(0.0802, device='cuda:0', grad_fn=<SubBackward0>)
tensor(6.9029, device='cuda:0', grad_fn=<NegBackward>) None

epoch
12 step 1250
tensor(0.1213, device='cuda:0', grad_fn=<SubBackward0>)
tensor(5.7503, device='cuda:0', grad_fn=<NegBackward>) None

epoch
12 step 1300
tensor(0.2494, device='cuda:0', grad_fn=<SubBackward0>)
tensor(3.1711, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1350
tensor(0.1131, device='cuda:0', grad_fn=<SubBackward0>)
tensor(8.1661, device='cuda:0', grad_fn=<NegBackward>) None
epoch
12 step 1400
tensor(0.1474, device='cuda:0', grad_fn=<SubBackward0>)
tensor(6.2397, device='cuda:0', grad_fn=<NegBackward>) None
epoch 12
step
1450
tensor(nan, device='cuda:0', grad_fn=<SubBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NegBackward>) None
epoch 12

step 1500

tensor(nan, device='cuda:0', grad_fn=<SubBackward0>)
tensor(nan, device='cuda:0', grad_fn=<NegBackward>) None

It seems to be possibly related with the data definition given as input, where some of the target domains were not covered by the data, for example when taking a subset. This seems to be inducing the effect earlier. But at the moment I am not using that and getting the Nans after about half an hour of training, which seems to be wierd. It's not visible in the text, but up to the Nan point, the convergence seems to be a normal GAN curve, first dipping and then steadily rising.

I'm testing with your toy datasets now to see if the issue persists.

Is this something you have encountered in your testing? Any thoughts?

Bugs when run VEEGAN for dataset credit

SDGym version:
Python version: 3.6
Operating System: OSX

Description

Thanks a lot for sharing the code, I tried to run the benchmark for VEEGAN with credit dataset, but I got the following bugs, I didn't really find where it exists the inplace operation, any ideas on that?

What I Did

Error computing scores for VEEGANSynthesizer on dataset credit - iteration 0
Traceback (most recent call last):
  File "<stdin>", line 8, in compute_benchmark
  File "/Users/zhaozilong/Documents/SDGym/sdgym/synthesizers/base.py", line 17, in fit_sample
    self.fit(data, categorical_columns, ordinal_columns)
  File "/Users/zhaozilong/Documents/SDGym/sdgym/synthesizers/veegan.py", line 148, in fit
    loss_g.backward(retain_graph=True)
  File "/anaconda3/envs/pytorch0.3/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/anaconda3/envs/pytorch0.3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [128, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

sdgym.benchmark function bug

Hello DAI-Lab, recently I am researching on data synthesizing and I found this amazing project.
I got an error when running sdgym.benchmark() function.

After tracking some codes. I think the variable synthesized should be a synthesized data produced by synthesizer.sample().
So we can solve this problem by replacing line 37:

synthesized = synthesizer(train, categoricals, ordinals)

into

synthesizer.fit(train, categoricals, ordinals)
synthesized = synthesizer.sample(train.shape[0])

Anyway to generate JSON structure and npy files in the required format for other datasets..

Thank you very much for your prompt response for intrusion dataset. I was wondering if you have any script to generate the JSON structure file and npy file for any given dataset. Were the JSON files of the datasets used for benchmarking hand-annotated? It would be great if you can share the script if you used one, it will help in benchmarking various datasets. Thank you.

Why not use model.train() in synthesizer's fit mode and model.eval() in synthesizer's sample mode?

Hi I am new to Pytorch. Some layers have different behavior during train/and evaluation (like BatchNorm, Dropout), but I can't see model's train() mode and eval() mode in the code of synthesizer's fit mode or sample mode. Why don't use train mode or eval mode?

Benchmark scores function, adding other dataset, adding external synthesizer

SDGym version: last one
Python version: 3.7
Operating System: mac

I am trying to execute this code :

import sdgym
from sdgym import benchmark
from sdgym.synthesizers import (
CTGANSynthesizer, TVAESynthesizer)

all_synthesizers = [
CTGANSynthesizer,
TVAESynthesizer
]
scores = benchmark(synthesizers=all_synthesizers)

Is this normal that it takes a lot of time ? Do you have tips to make it faster ?
And how can I add another dataset, imagine I have data.csv in my directory, how can I add it in benchmark function ? or how can I add only real datasets (from your examples)
How can I add another synthesizer in benchmark function when the synthesizer is a python package (ex: DataSynthesizer)

Thank you for you help !

Add parameters for PrivBNSynthesizer

Current PrivBNSynthesizer implementation has two fixed arguments: the theta value passed to the underlying privbn binary and the maximum number of samples being used from the training data.

These two values should be passed as optional parameters.

Travis CI tests failing

THe Travis CI tests are failing because of the following:

The job exceeded the maximum log length, and has been terminated.

Example: this job

Benchmarking integer data

When benchmarking real datasets, sdgym now differentiates between continuous, categorical and ordinal data. The continuous data can be real or integer (e.g. Age). Currently, the synthesizers are allowed to produce real-valued features for integer-valued columns that are used as features to train the models. It would make sense to distinguish between these types and restrict the range that synthesized features for integer columns can take. @csala Would you agree that this change is an improvement to the benchmark? (If so, I can contribute a PR)

rfc: replace synthesizer function with class

Description

SDGym's synthesizers all inherit from the Baseline class (or BaseSynthesizer class in previous versions). Users can provide custom synthesizer functions. The convenience inheritance is demonstrated throughout SDGym's code base and has all sort of other benefits. My suggestion would be to make the following changes:

All synthesizers should inherit from a synthesizer base class (Baseline)
All synthesizers should implement a separate fit and sample method

These changes provide consistency between SDGym's native and user-provided synthesizers and clear distinction between fit and sample logic, at nearly no cost:

def synthesizer_function(real_data: dict[str, pandas.DataFrame],
                         metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
    ...
    # do all necessary steps to learn from the real data
    # and produce new synthetic data that resembles it
    ...
    return synthetic_data

will become

from sdgym.synthesizers.base import Baseline


class MySynthesizer(Baseline):
    def fit(self, real_data: dict[str, pandas.DataFrame], metadata: sdv.Metadata) -> None:
        # ...
        # do all necessary steps to learn from the real data
        # ...
    
    def sample(self, n_samples: int) -> dict[str, pandas.DataFrame]:
        # and produce new synthetic data that resembles it
        return synthetic_data

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

The code that needs to be changed for this is minimal, however I wanted to make sure you see value in this point before drafting a PR.

Big bug at class LegacySingleTableBaseline(SingleTableBaseline)

SDGym version:
Python version:
Operating System:

Description

When generating data, the class LegacySingleTableBaseline transforms labels into numbers. However, after transforming, the authors do not rearrange the columns. This makes the model apply the one-hot encoding scheme on continuous columns and Gaussian Mixtures models for categorical columns during training.

What I Did

I think you could fix the bugs at the line 131 at synthesizers/base.py from: model_data = ht.transform(real_data) to model_data = ht.transform(real_data)[columns].

The bug will be solved.

Thanks

Boto Client Error: The provided token has expired

SDGym version: 0.4.2.dev0
Python version: 3.8.11
Operating System: Windows 10

Description

Trying to just get started running benchmarks

What I Did

import sdgym 

from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, Identity, Independent,
    MedGAN, PrivBN, TableGAN,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    Identity,
    Independent,
    MedGAN,
    PrivBN,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

Exception has occurred: ClientError (note: full exception trace is shown but execution is paused at: )
An error occurred (ExpiredToken) when calling the ListObjects operation: The provided token has expired.
File "C:\Users\crm0376\Projects\SDGym\sdgym\datasets.py", line 118, in get_available_datasets
response = s3.list_objects(Bucket=bucket or BUCKET)
File "C:\Users\crm0376\Projects\SDGym\sdgym\datasets.py", line 161, in get_dataset_paths
datasets = get_available_datasets()['name'].tolist()
File "C:\Users\crm0376\Projects\SDGym\sdgym\benchmark.py", line 321, in run
datasets = get_dataset_paths(datasets, datasets_path, bucket, aws_key, aws_secret)
File "C:\Users\crm0376\Projects\SDGym\testing_load.py", line 21, in (Current frame)
scores = sdgym.run(synthesizers=all_synthesizers)

Error in inverse transform of TableGAN transformer

I think I found an error in the inverse transformation of tableGAN.

The transformation is done with

$data_{transformed} = \frac{data_{org} - min}{max-min} * 2 -1$

In the following lines the inverse transformation is defined
https://github.com/DAI-Lab/SDGym/blob/c327ed6edf3f65270d5b339b3090e6745c3d21b8/sdgym/synthesizers/utils.py#L426-L429

The formula in this line is

$data_{org} = \frac{data_{transformed}+1}{2*(max-min)+min}$

However, it should be

$data_{org} = \frac{data_{transformed}+1}{2}*(max-min)+min$

Track execution times and report variation

The leaderboard could be extended to keep track of execution times of the synthesizers and report the variation of the measurements (by default the three iterations).

To some users, the execution time is an informative measure for evaluating the feasibility of using that synthesizer due to resource limitations (and also interesting in general). Reporting the variation for the measurements is necessary to be able compare synthesizers.

Upgrade dependency ranges

The latest versions of the compress-pickle and humanfriendly are not supported by SDGym:

Library	Upper bound (unsupported)	Latest release
`pandas`	1.1.5	1.3.1
`pomegranate`	0.14.2	0.14.5
`compress-pickle`	2	2.01
`humanfriendly`	9	9.2

We should investigate why and update the code if necessary to support them.

Provide clearer error message when privBayes.bin is not found

The PrivBNSynthesizer requires an executable that needs to be compiled from C++ code.

The current code of PrivBNSynthesizer only checks for the existence of the binary file and raises an
AssertionError if it is not found, which is hard to understand by the users.

The code should be changed to provide a user-friendly message indicating that privbayes needs to be compiled before it can be used and pointing at the corresponding documentation.

PrivBNSynthesizer missing privbayes.bin file

the demo PrivBNSynthesizer is looking for a file privbayes/privBayes.bin on https://github.com/DAI-Lab/SDGym/blob/aa2b82b2a68e9d0391ea67704b4b058cad867512/sdgym/synthesizers/privbn.py#L21, but in my local install (via pip install sdgym in conda env, macos), this file does not exist. Is this supposed to be generated during the install? Or do I need to do an extra steps to get this file?

Bug when using JSON configuration for multiple multi-table evaluation

SDGym version: 0.4.0
Python version: 3.8
Operating System: PopOS!

Description

When passing a json file as configuration for a multi-table synthesizer with more than one dataset, this ends up producing errors after evaluating the first dataset.

What I Did

Having a json configuration file, named HMA1.json with the following content:

{
    "name": "HMA1('gaussian', 'categorical_fuzzy')",
    "modalities": "multi-table",
    "synthesizer": "sdv.relational.HMA1",
    "init_kwargs": {
        "model_kwargs": {
            "default_distribution": "gaussian",
            "categorical_transformer": "categorical"
        },
        "metadata": "$metadata"
    },
    "fit_kwargs": {
        "tables": "$real_data"
    }
}

I run SDGym on multiple multi-table datasets:

sdgym run -s HMA1.json -d world_v1 trains_v1 -v

To which the following error is being produced:

Traceback (most recent call last):
  File "/home/work/Projects/SDV/SDGym/sdgym/benchmark.py", line 79, in _compute_scores
    score = metric.compute(*metric_args)
  File "/home/work/.virtualenvs/SDGym/lib/python3.8/site-packages/sdmetrics/multi_table/multi_single_table.py", line 102, in compute
    return cls._compute(cls, real_data, synthetic_data, metadata, **kwargs)
  File "/home/work/.virtualenvs/SDGym/lib/python3.8/site-packages/sdmetrics/multi_table/multi_single_table.py", line 62, in _compute
    raise ValueError('`real_data` and `synthetic_data` must have the same tables')

Add normalized score to benchmark results

SDGym version: 0.3.1

Description

Add normalized score to benchmark results, in addition to the raw metric score

Collect cached results from s3 bucket

SDGym version: 0.3.0

Description

In issue #79 we added a way to collect results from a collection of intermediate cached results to a single scores CSV file.

It should also be possible to read the intermediate cached results from an S3 bucket, and to store the results csv to a S3 bucket.

Python:

sdgym.collect.collect_results(input_path='s3://my-bucket/path/to/results', output_file='s3://my-bucket/path/to/my/results.csv')

CLI

sdgym collect ... -i s3://my-bucket/path/to/results -o s3://my-bucket/path/to/my/results.csv

If the bucket is private, the AWS key and secret introduce in PR #74 should be used.

Add references to Synthesizers

There are multiple synthesizers implemented based on some articles that are difficult to find. Is it possible to add a section with these references?

It is not working CTGANSynthesizer.

Hi, It is not working CTGANSynthesizer.

https://github.com/DAI-Lab/SDGym/blob/master/sdgym/synthesizers/ctgan.py
https://github.com/DAI-Lab/SDGym/blob/master/sdgym/synthesizers/utils.py

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-36-afe47608990b> in <module>
      3 synthesizer.fit(train_data = data2.values ,
      4                 categorical_columns = obj_col2,
----> 5                 ordinal_columns = ord_col2 , )
      6 print(synthesizer.sample(100).shape)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sdgym/synthesizers/ctgan.py in fit(self, train_data, categorical_columns, ordinal_columns)
    291         self.transformer = BGMTransformer()
    292         self.transformer.fit(train_data, categorical_columns, ordinal_columns)
--> 293         train_data = self.transformer.transform(train_data)
    294 
    295         data_sampler = Sampler(train_data, self.transformer.output_info)

~/anaconda3/envs/py36/lib/python3.6/site-packages/sdgym/synthesizers/utils.py in transform(self, data)
    351             else:
    352                 col_t = np.zeros([len(data), info['size']])
--> 353                 col_t[np.arange(len(data)), current.astype('int32')] = 1
    354                 values.append(col_t)
    355 

IndexError: index 31 is out of bounds for axis 1 with size 31

So I Check this.

## id 2 / info size 31
col_t = np.zeros([len( data2.values), 31])
print(col_t.shape)
current = data2.values[:, 2]
print(current.astype("int32").shape)
col_t[np.arange(len(data2.values)), current.astype('int32')] = 1

(10269, 31)
(10269,)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-54-e6b983069c50> in <module>
      3 current = data2.values[:, 2]
      4 print(current.astype("int32").shape)
----> 5 col_t[np.arange(len(data2.values)), current.astype('int32')] = 1

IndexError: index 31 is out of bounds for axis 1 with size 31

current = data[:, id_]

so current is 1-d array. it is not data2[:,id] unique values.
therefore I got an Index error.

how should I solve it?

Documentation for adding new datasets

There is no documentation on how to add new datasets in SDGym. Please add documentation for the method to add new datasets in SDGym.

segmentation fault

SDGym version:
Python version:
Operating System:

Description

I ran my synthesizer within the benchmark and all datasets ran fine except for: census, covtype, credit, intrusion and mnist12 where I get 'segmentation fault'

This is my wrapper function:
def ReplicasSynthesizer(real_data, categorical_columns, ordinal_columns):
print("categorical columns:")
print(categorical_columns)
print("ordinal columns:")
print(ordinal_columns)
print(real_data.shape)
print(real_data[0])
df = pd.DataFrame(real_data)
print("the columns are:")
print(df.columns)
df.dropna(axis=0, how='any', inplace=True)
syn_df = synthesis_lib.synthesize(df)
syn_np = df.to_numpy()
return syn_np

leaderboard = sdgym.run(synthesizers=ReplicasSynthesizer, datasets=['mnist28'])
print(leaderboard)

Running on mnist28 dataset:

sh-4.2$ python replicas_wrapper.py 
categorical columns:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784]
ordinal columns:
[]
(60000, 785)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 3]
the columns are:
RangeIndex(start=0, stop=785, step=1)
Segmentation fault

Benchmark MedganSynthesizer and VEEGANSynthesizer raise IndexError

MedganSynthesizer and VEEGANSynthesizer raise IndexError with the intrusion dataset.

What I did:

In [1]: from sdgym import benchmark                                                                  

In [2]: from sdgym.synthesizers import MedganSynthesizer, VEEGANSynthesizer                          

In [3]: medgan = MedganSynthesizer()                                                                 

In [4]: benchmark(medgan.fit_sample, datasets=['intrusion'])                                         
INFO - Evaluating dataset intrusion
INFO - Fitting MedganSynthesizer
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-fe34e4587459> in <module>
----> 1 benchmark(medgan.fit_sample, datasets=['intrusion'])

~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
     35 
     36         for iteration in range(repeat):
---> 37             synthesized = synthesizer(train, categoricals, ordinals)
     38             scores = evaluate(train, test, synthesized, meta)
     39             scores['dataset'] = name

~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
     15     def fit_sample(self, data, categorical_columns=tuple(), ordinal_columns=tuple()):
     16         LOGGER.info("Fitting %s", self.__class__.__name__)
---> 17         self.fit(data, categorical_columns, ordinal_columns)
     18 
     19         LOGGER.info("Sampling %s", self.__class__.__name__)

~/Projects/SDGym/sdgym/synthesizers/medgan.py in fit(self, data, categorical_columns, ordinal_columns)
    155         self.transformer = GeneralTransformer()
    156         self.transformer.fit(data, categorical_columns, ordinal_columns)
--> 157         data = self.transformer.transform(data)
    158         dataset = TensorDataset(torch.from_numpy(data.astype('float32')).to(self.device))
    159         loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True, drop_last=True)

~/Projects/SDGym/sdgym/synthesizers/utils.py in transform(self, data)
    151             else:
    152                 col_t = np.zeros([len(data), info['size']])
--> 153                 col_t[np.arange(len(data)), col.astype('int32')] = 1
    154                 data_t.append(col_t)
    155                 self.output_info.append((info['size'], 'softmax'))

IndexError: index 65 is out of bounds for axis 1 with size 64

In [5]: veegan = VEEGANSynthesizer()                                                                 

In [6]: benchmark(veegan.fit_sample, datasets=['intrusion'])                                         
INFO - Evaluating dataset intrusion
INFO - Fitting VEEGANSynthesizer
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-2c2f4cbda073> in <module>
----> 1 benchmark(veegan.fit_sample, datasets=['intrusion'])

~/Projects/SDGym/sdgym/benchmark.py in benchmark(synthesizer, datasets, repeat)
     35 
     36         for iteration in range(repeat):
---> 37             synthesized = synthesizer(train, categoricals, ordinals)
     38             scores = evaluate(train, test, synthesized, meta)
     39             scores['dataset'] = name

~/Projects/SDGym/sdgym/synthesizers/base.py in fit_sample(self, data, categorical_columns, ordinal_columns)
     15     def fit_sample(self, data, categorical_columns=tuple(), ordinal_columns=tuple()):
     16         LOGGER.info("Fitting %s", self.__class__.__name__)
---> 17         self.fit(data, categorical_columns, ordinal_columns)
     18 
     19         LOGGER.info("Sampling %s", self.__class__.__name__)

~/Projects/SDGym/sdgym/synthesizers/veegan.py in fit(self, train_data, categorical_columns, ordinal_columns)
    106         self.transformer = GeneralTransformer(act='tanh')
    107         self.transformer.fit(train_data, categorical_columns, ordinal_columns)
--> 108         train_data = self.transformer.transform(train_data)
    109         dataset = TensorDataset(torch.from_numpy(train_data.astype('float32')).to(self.device))
    110         loader = DataLoader(dataset, batch_size=self.batch_size, shuffle=True, drop_last=True)

~/Projects/SDGym/sdgym/synthesizers/utils.py in transform(self, data)
    151             else:
    152                 col_t = np.zeros([len(data), info['size']])
--> 153                 col_t[np.arange(len(data)), col.astype('int32')] = 1
    154                 data_t.append(col_t)
    155                 self.output_info.append((info['size'], 'softmax'))

IndexError: index 65 is out of bounds for axis 1 with size 64

col_t is being accessed by the values of the column instead of the index

CTGAN failing for intrusion dataset

Hi @csala, thank you very much for open-sourcing the package. The baseline-models are working on all the datasets except intrusion where the following error is raised-

IndexError: index 65 is out of bounds for axis 1 with size 64.

Can you please help me out with this error. Thank you.

Add a way to collect cached results

SDGym version: 0.3.0

Description

Apart from producing a single dataframe or CSV file with the scores obtained by all the Synthesizers, SDGym has the option to store intermediate results, scores and error logs as the different tasks are run, which are kept inside the cache_dir. However, if the sdgym process is cut for some reason, there is no way to find all the intermediate results and put them together as a single CSV file again.

It would be interesting to have a collect_results function and an sdgym collect command that would do this job and allow producing a single scores CSV file from a collection of intermediate cached results.

Automatically detect number of workers

SDGym version: 0.3.1

Description

SDGym can automatically determine how many workers to use, based on the available GPUs or CPUs on the current machine. If the machine has GPUs, use the number of GPUs as the workers value. Otherwise use the number of CPUs. The user can request to automatically detect the number of workers by passing in workers=-1.

Problem reshaping the data

Hi, I'm having troubles runing the code because of the function gm.fit(data[:, id_].reshape([-1, 1])) in the fit method of the BGMTransformer class, line 310.

It seems that the reshape made by the function data[:, id_].reshape([-1, 1]) leads to an error '(slice(None, None, None), 2)' is an invalid key. I tried to reshape by another methods like using loc or iloc, which works from terminal but I'm geting the same error when I change the function in the scrip. I don't know what to do now.

I loaded my own dataset and tried to run but I can't because of the error. That's what I am doing:

from sdgym.synthesizers.tvae import TVAESynthesizer
tvae = TVAESynthesizer()
tvae.fit(my_data, my_discrete_columns, my_ordinal_columns)

List Synthesizer names

SDGym version: 0.3.0
Python version:
Operating System:

Description

Add a function and CLI command to list the available synthesizer names.

Is it possible to apply tensorflow custom function?

hello, I see sdgym/synthesizers.

the functions are written by PyTorch.

I usually use TensorFlow.
Is this package also applicable to the TensorFlow code?

and how can I save the learned model?

Thanks

Citation for the Code and Paper

Hi,

Thank you for making this repository available.

Can you please add a way to cite this repository both code and paper, as the repo has moved a bit from Conditional Table GAN paper.

sdv-dev / sdgym Goto Github PK

sdgym's Introduction

Overview

Install

Usage

Supplying a custom synthesizer

Customizing your datasets

What's next?

sdgym's People

Contributors

Stargazers

Watchers

Forkers

sdgym's Issues

Description

Problem Description

Additional context

Docstring plugin

Flake8 plugins to be added

Description

What I Did

Description

Description

Description

Description

Description

Description

What I Did

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

Description

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs