diyago / gan-for-tabular-data Goto Github PK

We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.

Home Page: https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342

License: Apache License 2.0

Python 100.00%

adversarial-filtering deep-learning feature-engineering gan gans machine-learning python tabular-data train-dataframe

gan-for-tabular-data's Introduction

Hello, I'm Insaf Ashrapov, Lead Data Scientist, Sberbank 👋

Top rank on kaggle is 417.

My portfolio with some of my projects, which include Kaggle kernels, pet-projects, articles, etc.

My Github stats

gan-for-tabular-data's People

Contributors

Stargazers

Watchers

gan-for-tabular-data's Issues

ImportError: cannot import name '_CTGANSynthesizer'

I encountered this problem, could you give me some advice?
Traceback (most recent call last):
File "/home/zw/data/GAN-for-tabular-data-master/Research/run_experiment.py", line 10, in
from utils import save_exp_to_file, extend_gan_train, extend_from_original
File "/home/zw/data/GAN-for-tabular-data-master/Research/utils.py", line 8, in
from ctgan import _CTGANSynthesizer
ImportError: cannot import name '_CTGANSynthesizer'

LGBMClassifier.fit() got an unexpected keyword argument 'early_stopping_rounds'

Hello, i keep on getting this particular error from running the GANGenerator
is there a stable version i can install to avoid this.

second args in generate_data_pipe cannot be left None

Hi,

I have tried your example on Timeseries CTGAN generation. However, I am getting an error saying

     80 
     81         if any(X.index != y.index):
---> 82             raise ValueError("`X` and `y` both have indexes, but they do not match.")
     83         if X.shape[0] != y.shape[0]:
     84             raise ValueError("The length of X is " + str(X.shape[0]) + " but length of y is " + str(y.shape[0]) + ".")

ValueError: `X` and `y` both have indexes, but they do not match.

So I am guessing the second argument cannot be left None somehow. But can you elaborate more on what is the target argument?

Parameter name

In models.py
class Generator(Module):
def init(self, embedding_dim, gen_dims, data_dim):

should not be

class Generator(Module):
def init(self, embedding_dim, gen_dim, data_dim):

gen_dims is equal gen_dim?

Dependency issue with ForestDiffusion Generator

After installing the package, there are several import errors like
module ForestDiffusion not found (probably because the directory is named _ForestDiffusion)
module xgboost and catboost not found

about def extend_from_original

Hi Diyago,

Thanks for sharing the code! It was very well explained and very well written. I have a small comment. In utils.py, in the function def extend_from_original , the comments are same as in def_gen_train. (seems like copy paste) We sample from original set in the first one and use "CTGANSynthesizer()" in the second one.

def extend_from_original(x_train, y_train, x_test, cat_cols, gen_x_times=1.2):
"""
Extends train by generating new data by GAN. (not true, right?)
:param x_train: train dataframe
:param y_train: target for train dataframe
:param x_test: dataframe
:param cat_cols: List of categorical columns
:param gen_x_times: Factor for which initial dataframe should be increased
:param cat_cols: List of categorical columns
:return: extended train with target

Fail to install

!pip install tabgan

Error:

Requirement already satisfied: tabgan in /usr/local/lib/python3.7/dist-packages (1.2.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from tabgan) (1.21.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from tabgan) (4.63.0)
Requirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (from tabgan) (1.10.0+cu111)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from tabgan) (1.3.5)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from tabgan) (2.8.2)
Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from tabgan) (0.11.1+cu111)
Requirement already satisfied: lightgbm in /usr/local/lib/python3.7/dist-packages (from tabgan) (2.2.3)
Requirement already satisfied: category-encoders in /usr/local/lib/python3.7/dist-packages (from tabgan) (2.4.0)
Collecting scikit-learn==0.23.2
  Using cached scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->tabgan) (1.1.0)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->tabgan) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==0.23.2->tabgan) (3.1.0)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from category-encoders->tabgan) (0.5.2)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from category-encoders->tabgan) (0.10.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->tabgan) (2018.9)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from patsy>=0.5.1->category-encoders->tabgan) (1.15.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch->tabgan) (3.10.0.2)
Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision->tabgan) (7.1.2)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.
Successfully installed scikit-learn-0.23.2

Difference between OriginalGenerator and GANGenerator

Hello,

First of all, I apologize to ask questions too frequently. However, I am using your code well so much :). Thanks in advanced!

My question is that while I was reading your arxiv paper, I saw that you wrote two different tabular gans (TGAN and CTGAN).
Thus, is the original generator pertaining to TGAN and
GANGenerator is the CTGAN?

I am writing down notes to myself so that I can understand better.

Thank you!!

Trying to run how to use library part

Describe the bug
How to tutorial not working for me

To Reproduce
Steps to reproduce the behavior:

Install pip and copypaste the example code under how to

Expected behavior
Get error that I give to many positional arguments

Screenshots

Desktop (please complete the following information):

OS: Windows

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

We are doing a preprocessing step before using Gan.
First, we take the tabular data the contain strings and booleans and convert it all to floats
Second, we split the data to test and train (to save test sample from original data)
Third, split the test again for the TabGan.
Every time we do that we get :
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
This is coming from sklearn -> validation.py.
In the data we send to the model there are not nan's and no infit.

Is there a way to solve this issue?

Sampler demo freezes at same point, multiple installs. Windows 10

I have tried several installs on two different machines.

Most recent, Win10, virtual env with py 3.8.11 and only your requirements.txt modules installed, plus Jupyter...

Most recent attempt:
created virtual environment with python 3.8
(gann) C:\Users\klaus>python -V
Python 3.8.11

ran a renamed requirements.txt file rqr0.txt - no torch 1.6
(gann) C:\Users\klaus>python -m pip install -r rqr0.txt
ERROR: Could not find a version that satisfies the requirement torch==1.6.0 (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.7.1, 1.8.0, 1.8.1, 1.9.0)
ERROR: No matching distribution found for torch==1.6.0

So ; installed torch 1.6
(gann) C:\Users\klaus>conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
Executing transaction: done

installed required modules
(gann) C:\Users\klaus>python -m pip install -r rqr0.txt

installed tabgan
(gann) C:\Users\klaus>pip install tabgan

ON ALL MACHINES, ALL INSTALLS, the sampler freezes at the same point with no error message.

Fitting CTGAN transformers for each column: 100% 5/5 [00:00<00:00, 9.31it/s]
Training CTGAN, epochs:: 33% 167/500 [00:04<00:06, 51.27it/s]
Fitting CTGAN transformers for each column: 100% 5/5 [00:00<00:00,10.67it/s]
Training CTGAN, epochs:: 23%

I have tried python 3.9 installs on two different computers.
What further info can I provide to give clues as to what is happening?
Thanks.

not running on GPU

Describe the bug
The library isn't using GPU when cuda is available.

To Reproduce
Run the sample script. Check GPU usage - no GPU used.

Expected behavior
The library should recognise when cuda is available, and use GPU for the calculations.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
Problem occurs on Windows and Google Colab. May possibly occur on other platforms.

Additional context
Add any other context about the problem here.

check

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

Describe the bug
A clear and concise description of what the bug is.

TypeError Traceback (most recent call last)
Input In [1], in <cell line: 11>()
8 test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
10 # generate data
---> 11 new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
12 new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
14 # example with all params defined

File ~/.conda/envs/medicine/lib/python3.8/site-packages/tabgan/abc_sampler.py:77, in SampleData.generate_data_pipe(self, train_df, target, test_df, deep_copy, only_adversarial, use_adversarial, only_generated_data)
75 if use_adversarial:
76 logging.info("Applying adversarial filtering")
---> 77 new_train, new_target = generator.adversarial_filtering(
78 new_train, new_target, test_df
79 )
80 gc.collect()
82 logging.info("Total finishing, returning data")

File ~/.conda/envs/medicine/lib/python3.8/site-packages/tabgan/sampler.py:204, in SamplerOriginal.adversarial_filtering(self, train_df, target, test_df)
202 self._validate_data(train_df, target, test_df)
203 train_df[self.TEMP_TARGET] = target
--> 204 ad_model.adversarial_test(test_df, train_df.drop(self.TEMP_TARGET, axis=1))
206 train_df["test_similarity"] = ad_model.trained_model.predict(
207 train_df.drop(self.TEMP_TARGET, axis=1)
208 )
209 train_df.sort_values("test_similarity", ascending=False, inplace=True)

File ~/.conda/envs/medicine/lib/python3.8/site-packages/tabgan/adversarial_model.py:63, in AdversarialModel.adversarial_test(self, left_df, right_df)
55 concated = pd.concat([left_df, right_df])
56 lgb_model = Model(
57 cat_validation=self.cat_validation,
58 encoders_names=self.encoders_names,
(...)
61 model_params=self.model_params,
62 )
---> 63 train_score, val_score, avg_num_trees = lgb_model.fit(
64 concated.drop("gt", axis=1), concated["gt"]
65 )
66 self.metrics = {"train_score": train_score,
67 "val_score": val_score,
68 "avg_num_trees": avg_num_trees}
69 self.trained_model = lgb_model

File ~/.conda/envs/medicine/lib/python3.8/site-packages/tabgan/adversarial_model.py:176, in Model.fit(self, X, y)
174 mean_score_train = np.mean(self.scores_list_train)
175 mean_score_val = np.mean(self.scores_list_val)
--> 176 avg_num_trees = int(np.mean(self.models_trees))
178 return mean_score_train, mean_score_val, avg_num_trees

File <array_function internals>:180, in mean(*args, **kwargs)

File ~/.conda/envs/medicine/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474, in mean(a, axis, dtype, out, keepdims, where)
3471 else:
3472 return mean(axis=axis, dtype=dtype, out=out, **kwargs)
-> 3474 return _methods._mean(a, axis=axis, dtype=dtype,
3475 out=out, **kwargs)

File ~/.conda/envs/medicine/lib/python3.8/site-packages/numpy/core/_methods.py:179, in _mean(a, axis, dtype, out, keepdims, where)
176 dtype = mu.dtype('f4')
177 is_float16_result = True
--> 179 ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
180 if isinstance(ret, mu.ndarray):
181 ret = um.true_divide(
182 ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Windows, Ubuntu]
Ubuntu/Linux server with python 3.8 and on jupyter notebook
Additional context
Add any other context about the problem here.

I was trying to do tabgan on your example with random data but it does not seem to work.
I tried with my own tabular data (with no categorical variables and without any NaNs but has 0s of course).
but it does not seem to work and print out the same result.

Do you have any idea why it is not working?

thank you in advanced.

TypeError w/ Boolean Data

I have a dataset with boolean columns that refer to the bits in an IP address, at some point in training the CTGAN model the execution halts with the following error from NumPy:

"TypeError: numpy boolean subtract, the - operator, is not supported, use the bitwise_xor, the ^ operator, or the logical_xor function instead."

pip install scikit-learn version issue

Currently having issues while installing mac terminal with pip command. I have scikit learn 1.0 installed.

Collecting scikit-learn==0.23.2
Using cached scikit-learn-0.23.2.tar.gz (7.2 MB)
Installing build dependencies ... error
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [2515 lines of output]

Some issues araised when running Tab-GAN: 1) Manage Categorical Variables. 2) Batch size problem

Hi, i am new to the Tab-GAN and i am trying to use it for my thesis, in order to manage class in some way class imbalance inside my dataset.
However, i have faced some problems while using it:

In my example i have a tabular data containing both Numerical and Categorical examples, specifically the "Income_Category" column contains strigs:

In the example shown on the TabGAN application the value of cat_cols is None, however in my case i should list the Categorical Columns (i.e cat_cols = ["Income_Category"].
However, when applying TabGAN is raise an error: "Input X contains Nans"

Despite any Nans are present in the input DataFrame:

Moreover, sometimes it make impossible for the GANGenerator to manage string values, in this case i should apply a categorical values encoding, what kind of preprocessing it is necessary to apply, one-hot or ordinal encoding, and how i should modify the param cat_cols?

Trying to modify the batch_size paramater of the GAN, raised this error:

What should be the cause?

is it ok for regression type task?

Hi,
Thank you for this library, it's reallly useful and great to apply for tabular data, especially I'm interested in using this for regression task, I have adapted the initial code and it gave some result, but not sure whether it's good for prediction task, also I didn't quite understand difference between OriginalGenerator and GANGenerator. Could you give an explanation to this? Thank you! :-)

Hello, when I run "python ./Research/run_experiment.py", I see the error,please help me, thanks!

PS E:\GAN-for-tabular-data-master\GAN-for-tabular-data-master> python ./Research/run_experiment.py
Traceback (most recent call last):
File "./Research/run_experiment.py", line 10, in
from utils import save_exp_to_file, extend_gan_train, extend_from_original
File "E:\GAN-for-tabular-data-master\GAN-for-tabular-data-master\Research\utils.py", line 8, in
from ctgan import _CTGANSynthesizer
ImportError: cannot import name 'CTGANSynthesizer' from 'ctgan' (E:\GAN-for-tabular-data-master\GAN-for-tabular-data-master\Research\ctgan_init.py)

all sample codes not working till epoch end

I have tried all bunchs of sample codes both run on Linux, Mac and AWS SageMaker and they will all pop up error message during the training. Ask for an update or bug-fix on your code.

support Python>=3.7

1.0.3 release requires python version 3.6.10
any newer version doesn't let install the wheel.
plus - there is no tar.gz file in pip installation so can't install without the wheel file.

IntCastingNaNError Despite No NaN values

I made sure to filter out for NaN values with several different methods such as

def isNaN(num):
return num != num

Yet despite all methods assuring me there are no NaN values, I throw the "IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer" error on line 182 of astype.py

My code is as follows in colab and I attached the dataset

Import

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

Upload

df = pd.read_excel(io.BytesIO(uploaded['Nan_Error_Data.xlsx']))

COLS_USED = ['v', 'vw', 'o', 'c', 'h', 'l', 'n']
COLS_TRAIN = ['v', 'vw', 'o', 'h', 'l', 'n']

df = df[COLS_USED]

Split into training and test sets

df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(
df.drop("c", axis=1),
df["c"],
test_size=0.20,
#shuffle=False,
random_state=42,
)

Create dataframe versions for tabular GAN

df_x_test, df_y_test = df_x_test.reset_index(drop=True),
df_y_test.reset_index(drop=True)
df_y_train = pd.DataFrame(df_y_train)
df_y_test = pd.DataFrame(df_y_test)

Pandas to Numpy

x_train = df_x_train.values
x_test = df_x_test.values
y_train = df_y_train.values
y_test = df_y_test.values

Import

from tabgan.sampler import GANGenerator
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Parameters

gen_x, gen_y = GANGenerator(gen_x_times=1.1, cat_cols=None,
bot_filter_quantile=0.001, top_filter_quantile=0.999,
is_post_process=True,
adversarial_model_params={
"metrics": "rmse", "max_depth": 2, "max_bin": 100,
"learning_rate": 0.02, "random_state":
42, "n_estimators": 500,
}, pregeneration_frac=2, only_generated_data=False,
gan_params = {"batch_size": 500, "patience": 25,
"epochs" : 500,}).generate_data_pipe(df_x_train, df_y_train,
df_x_test, deep_copy=True, only_adversarial=False,
use_adversarial=True)

Nan_Error_Data.xlsx

Why is the data generated using epoch=10 similar to the data generated using epoch=5000 for classification?

training CTGAN stops in the middle (around 24%)

Describe the bug
A clear and concise description of what the bug is.

While traing CTGAN, it stops in a certain percentage of epoch (mostly 20~30%)
To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Hello!
As you mentioned in my previous issue, I tried running your example code and it is kind of working to a certain extent.
tqdm stops in 24% (or somewhere close to that number) of training CTGAN (screenshot added).
No other messages appear but the code is done running. Am I missing something here?

Thank you in advanced!

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Windows, Ubuntu]

jupyter notebook on ubuntu/linux server
google colab

Additional context
Add any other context about the problem here.

Dear Author, May I know the ctgan version for the installation? I am getting error. from ctgan import _CTGANSynthesizer ImportError: cannot import name '_CTGANSynthesizer'

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Windows, Ubuntu]

Additional context
Add any other context about the problem here.

ValueError: Input train dataframe already have 0 column, consider removing it

Describe the bug
Traceback (most recent call last):
File "c:\Users\yrouis2\Desktop\KCA-codes\GANs\TestGANs.py", line 30, in
new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None, bot_filter_quantile=0.001,
File "C:\Users\yrouis2\Desktop\KCA-codes\GANs\gans\lib\site-packages\tabgan\abc_sampler.py", line 41, in generate_data_pipe
new_train, new_target, test_df = generator.preprocess_data(train_df.copy(), target.copy(), test_df)
File "C:\Users\yrouis2\Desktop\KCA-codes\GANs\gans\lib\site-packages\tabgan\sampler.py", line 90, in preprocess_data
raise ValueError(
ValueError: Input train dataframe already have 0 column, consider removing it

Hello, first of all thank you for sharing the code !
I have an issue, I passed my dataset to the generator but got this error everytime, does that mean that my dataset contains a column of zeros in all rows ? if yes what's the problem of not considering it

Mistake in Readme

Hi,
I think I found a mistake in the readme file of this repository.
For the parameters being passed in function, the description of 'top_filter_quantile' should be "top quantile for postprocess filtering" instead of "bottom quantile for postprocess filtering".

It's not a big issue. I don't know the protocols of this repo. Should i change it and generate a PR or you will do it yourself?
Thanks.

Best Regards,
Ikram

Getting this error when trying to install load

Describe the bug
(scikit-learn 1.0.2 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('scikit-learn==0.23.2'), {'tabgan'})

To Reproduce
Steps to reproduce the behavior:

Go to "google colab or kaggle".
Click on from "tabgan.sampler import OriginalGenerator, GANGenerator" to load the libraries.
See error

Expected behavior
It should load with out error because the requirements are already installed via pip.

Screenshots

Desktop (please complete the following information):

OS: [e.g. Windows, Ubuntu]

Generated data from GANGenerator() gives worse result in training for regression task

I used the data generated from GANGenerator in training and it gives worse results for the regression task I'm doing. I tried both default parameters and some custom parameters replicating what I saw in the example of this repo. Both of them don't work well. Here is the code I created the model:
new_train, new_target = GANGenerator(gen_x_times=0.5, cat_cols=None, bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True, adversaial_model_params={ "metrics": "MSE", "max_depth": 15, "max_bin": 100, "n_estimators": 600, "learning_rate": 0.001, "random_state": 42, }, pregeneration_frac=2, epochs=300).generate_data_pipe(train, target, test, deep_copy=True, only_adversarial=False, use_adversarial=True, only_generated_data=True)
One thing I noticed is that my training is stopped at around 90-120 epochs, even I have set it to be 300. I'm not sure whether I set some parameters poorly. So, I want some suggestions on how should I define parameters for this model to work with regression tasks. By the way, my data has already been preprocessed and cleaned containing one-hot encoded features and scaled features.

Reproducibility issue

Describe the bug
The Tab-GAN application does not retrieve same results when run multiple times with the same parameters' setting

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
When run different times with same parameters, the generated output must be the same. Instead, in this way it is not possible to understand if what we retrieve is good or just a lucky run.
Random State is only present in the Adversarial Filtering, not in the GAN Synthetizer.
There exist a way to retrieve always the same result when parameters are not changed?

Generating samples for only One class among two available

Hello,

After giving my data that contains two classes (equilibrated size 50% of data each class) as input to the GANGenerator, the output generated data was :
5.665401393688743e-05 for one class and 0.9999433459860632 for the other one.

Do you may know the origins of this issue ?

Best regards

Cannot run example code in readme without error

Error
Fitting CTGAN transformers for each column: 0%| | 0/5 [00:00<?, ?it/s] Traceback (most recent call last):
File "run_2.py", line 12, in
new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
File "C:\Users\User\Repos\tabgantest\lib\site-packages\tabgan\abc_sampler.py", line 69, in generate_data_pipe
new_train, new_target, test_df, only_generated_data
File "C:\Users\User\Repos\tabgantest\lib\site-packages\tabgan\sampler.py", line 247, in generate_data
ctgan.fit(train_df, [], epochs=self.gan_params["epochs"])
File "C:\Users\User\Repos\tabgantest\lib\site-packages_ctgan\synthesizer.py", line 163, in fit
self.transformer.fit(train_data, discrete_columns)
File "C:\Users\User\Repos\tabgantest\lib\site-packages_ctgan\transformer.py", line 77, in fit
meta = self._fit_continuous(column, column_data)
File "C:\Users\User\Repos\tabgantest\lib\site-packages\sklearn\utils_testing.py", line 313, in wrapper
return fn(*args, **kwargs)
File "C:\Users\User\Repos\tabgantest\lib\site-packages_ctgan\transformer.py", line 34, in _fit_continuous
n_init=1
TypeError: init() takes 1 positional argument but 2 positional arguments (and 3 keyword-only arguments) were given

Describe the bug
Following the instructions from the readme, I created a python venv and installed tabgan then placed the example code from the git readme and it was unable to complete without error.

To Reproduce
Steps to reproduce the behavior:
pip install tabgan
copy and paste first block of code given in readme into python file
attempt to run code

Expected behavior
For the given code to function without error

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows

Additional context
Python 3.7.7

Cannot install tabgan

Describe the bug
A clear and concise description of what the bug is.
I installed tabgan on colab using pip,but when I did import tabgan I ran into the following error
ContextualVersionConflict: (scikit-learn 1.0.1 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('scikit-learn==0.23.2'), {'tabgan'})
On windows 10 python 3.9,tabgan installation fails because sci-kit wheel not being built
To Reproduce
Steps to reproduce the behavior:

Go to colab
execute !pip install tabgan
execute import tabgan
See error

Expected behavior
A clear and concise description of what you expected to happen.
import tabgan should run without producing any errors
Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. Windows, Ubuntu] Windows 10;colab

Additional context
Add any other context about the problem here.

Kindly ask for help in Google Colab

Dear Insaf Ashrapov,
Lead Data Scientist at Sberbank,

thank you for your fruitful ideas and codes regarding GAN for tabular data published at:
https://github.com/Diyago/GAN-for-tabular-data

Recently, we applied your example code from the section „How to use the library“ in Google Colab and it performed well on our data. However, last two weeks we obtained an error in Google Colab environment.
<img width="1063" alt="colab_error"

src="https://user-images.githubusercontent.com/95305399/144093966-11dceba2-ed13-4fd7-a6a9-aab9d7aeb2b0.png">

After using your example code in Google Colab, the following error was obtained (in the figure in attachment)

I would be very glad for your help and look at this issue.

Sincerely
Lubomir, Slovakia

generated Cov is not that close

This is maybe more of a discussion point, I have three variables which have a covariance C, and I want the generated data to have the same covariance. At the moment the generated C is not that close to the true C.

n = 1000
cov = np.array([[3, 2, 1], [2, 3, 2], [1, 2, 3]]) / 3
print(cov)

X = np.random.multivariate_normal([1, 2, 3], cov, n)

train = np.arange(n // 2)
test = np.arange(n // 2, n)

# predict x0 from x1 and x2
reg = linear_model.LinearRegression()
reg.fit(X[train, 1:], X[train, 0])
pred1 = reg.predict(X[test, 1:])
print(np.mean(abs(X[test, 0] - pred1)))

train_df = pd.DataFrame(X[train, 1:], columns=['a', 'b'])
target = pd.DataFrame(X[train, 0], columns=['y'])
test_df = pd.DataFrame(X[test, 1:], columns=list(train_df))

gen_x, gen_y = GANGenerator(gen_x_times=1.1).generate_data_pipe(train_df, target, test_df)

print(len(gen_x))

X2 = np.zeros((gen_x.shape[0], 3))
X2[:, 0] = gen_y
X2[:, 1:] = gen_x

print(X2.mean(axis=0))
print(np.cov(X2.T))

pred2 = reg.predict(X2[:, 1:])

print(np.mean(abs(pred2 - X2[:, 0])))

[1.24107277 2.19440913 2.56302274]
[[ 1.32843079  0.18899326 -0.05254613]
 [ 0.18899326  1.25402907  0.16901545]
 [-0.05254613  0.16901545  1.26216313]]

ContextualVersionConflict: (scikit-learn 1.0.2 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('scikit-learn==0.23.2'), {'tabgan'})

please I ran into contextualVersionConflict error when I run this line of code "from tabgan.sampler import GANGenerator" please any help on this please?

_CTGANSynthesizer

Dear researchers,

When i run the run_experiment.py, importError: cannot import name '_CTGANSynthesizer' from 'ctgan'. I find that the '_CTGANSynthesizer' was not defined in the ctgan. Am i right or i miss something? Thank you very much!

Regards,
Yufei

My pip installment is as follows
category-encoders 2.1.0
certifi 2021.10.8
charset-normalizer 2.0.12
colorama 0.4.4
idna 3.3
joblib 1.1.0
lightgbm 2.3.1
numpy 1.18.1
threadpoolctl 3.1.0
torch 1.8.0 (this is the higher version)
torchvision 0.12.0
tqdm 4.61.1
typing_extensions 4.2.0
urllib3 1.26.9
wheel 0.37.1

Taking only 160 samples as input for generating new samples and also gives negative values in some columns of input tabular data

I have tabular data with 200k samples but while implementing the code mentioned gave me memory error. Therefore, I had to reduce dataset to 160 samples only. Can we generate new dataset for more than 100k samples or 200k samples.
Also, there were negative integer values in some columns of new dataset generated which are supposed to be positive integer values.

ValueError: Input X contains NaN although NaN filtered

Describe the bug
Although I filter all NaN from my df, I get a ValueError and don't understand why. This is what I am doing:

`
data = pd.read_excel(path_data)

start_index = data.columns.get_loc("start_column")
end_index = data.columns.get_loc("end_column")
columns_between = data.columns[start_index:end_index]

df = data[columns_between]
df = df.dropna()
train, test = train_test_split(df, test_size=0.2, random_state=42)
target = pd.DataFrame({'Y': [1.0] * train.shape[0]}) #as every line in the dataset is not generated, I suppose I just make a target df with ones only

new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
adversarial_model_params={
"metrics": "AUC", "max_depth": 2, "max_bin": 100,
"learning_rate": 0.02, "random_state": 42, "n_estimators": 500,
}, pregeneration_frac=2, only_generated_data=False,
gan_params = {"batch_size": 500, "patience": 25, "epochs" : 500,}).generate_data_pipe(train, target,
test, deep_copy=True, only_adversarial=False, use_adversarial=True)
`

Result:

ValueError: Input X contains NaN.
BayesianGaussianMixture does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Desktop (please complete the following information):

OS: Windows 10

Additional context
The df consists of 3600 lines and 26 columns of numpy.float64 numbers.

diyago / gan-for-tabular-data Goto Github PK

gan-for-tabular-data's Introduction

Hello, I'm Insaf Ashrapov, Lead Data Scientist, Sberbank 👋

My Github stats

gan-for-tabular-data's People

Contributors

Stargazers

Watchers

Forkers

gan-for-tabular-data's Issues

Import

Upload

Split into training and test sets

Create dataframe versions for tabular GAN

Pandas to Numpy

Import

Parameters

Recommend Projects

Recommend Topics

Recommend Org

Jobs