GithubHelp home page GithubHelp logo

sdv-dev / ctgan Goto Github PK

View Code? Open in Web Editor NEW
1.1K 21.0 271.0 1.8 MB

Conditional GAN for generating synthetic tabular data.

License: Other

Makefile 5.44% Python 94.56%
synthetic-data generative-adversarial-network tabular-data data-generation synthetic-data-generation

ctgan's People

Contributors

amontanez24 avatar baukebrenninkmeijer avatar csala avatar deathn0t avatar fealho avatar frances-h avatar jdtheripperpc avatar katxiao avatar kevinykuo avatar leix28 avatar lurosenb avatar matheusccouto avatar mfhbree avatar npatki avatar oregonpillow avatar pvk-developer avatar r-palazzo avatar sdv-team avatar tejuafonja avatar timvink avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctgan's Issues

Targeted sample generation

Instead of random sample generation is it possible to create samples based on predefined inputs (discrete or continuous). So for the adult dataset if I wanted my sample to be of age X and from country Y and have the GAN generate the rest of the characteristics is that possible?

Without Discrete columns the Loss increase strongly

Description

I have fitted the Synthesizer with the same dataset several times, making litle changes.
Always, all the epochs have had losses (Loss G & Loss D) under |4|. None of the maximum numbers passed that threshold. And, in 300 epoch I usually get less than 0.8 .
But when I train the model without the only discrete column that I have in my dataset, both losses increase very much. The Loss G peaked at -95 in epoch 43. For other hand, the loss D peaked at -2333 in epoch 37, and it scored -9 in epoch 300

What I Did

I created a discrete column with only one label, and the losses decreses to the normal numbers.
Do you know what could have happened? Why the losses are increasing in that way when I take out the only discrete column?

image
image

Check discrete_columns valid before fitting

As @csala mentioned in #24 it would be good to check that discrete_columns list is valid at the beginning of fitting instead of silently ignoring invalid columns then throwing an error later in the fitting process.

What I Did

Would something similar to this at the beginning of fitfunction work? :

for col in discrete_columns:
  if col not in data.columns:
    print("*discrete_columns error*")
    print(col + " not found in data")
    col_error = True
  else:
    col_error = False
if col_error is True:
  sys.exit()

Reorganize the project structure

A few changes can be made to improve the overall code maintainability:

  • Modularize the code, separating code dedicated to specific topics on different modules.
  • Split some parts of the code in smaller, single purpose methods.
  • Rename the Cond class to ConditionalGenerator.
  • Make random_choice_prob_index a method of the ConditionalGenerator class.
  • Make calc_gradient_penalty a method of the Discriminator class.
  • Make cond_loss a CTGANSynthesizer method.
  • Make a _build_models method where all the internal models are built. See #4
  • Separate the fitting process in two methods:
    • One with categorical variables, which uses the cond vector
    • Another one without categorical variables.
  • Rename some variables to be a bit more verbose
  • Replace plain asserts with user more verbose and type specific exceptions
  • Add docstrings to all the public methods
  • Remake cli.py as __main__.py and use argparse instead of absl
  • Move epochs arg to fit method. See #5

The new project structure should look as follows:

ctgan
├── __init__.py
├── __main__.py: New CLI module
├── conditional.py: ConditionalGenerator
├── data.py: Methods to write and read data
├── models.py: Discriminator, Generator and Residual classes
├── sampler.py: Sampler class
├── synthesizer.py: CTGANSynthesizer class
└── transformer.py: DataTransformer class

Differential privacy

Does this implementation guarantee differential privacy? This is mentioned in the paper as possible with CTGAN.

Is synthetic data always anonymous?

One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.

I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?

E.g. For a given DF containing columns: age , education, race, test-score, .....etc

Perhaps age, education, race, test-score are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination of age, education, race, test-score that is also present in original DF.

I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?

Gaussian mixture params/training sample size

Any tips on optimizing performance/training time for the Bayesian Gaussian mixture training phase? Could we consider exposing the parameters and perhaps include sampling the training set? This piece doesn't seem to scale well to bigger datasets.

Saving a CTGAN model

  • CTGAN version:
  • Python version:
  • Operating System:

Description

I would like to save in disk a CTGAN trained model. I did it in the past using TGAN using tgan.save command.

What I Did

It seems it's not possible to save models

Thanks for your jelp

Doubt with Likelihood fitness metric

Hi,

I am trying to understand the Evaluation Metrics and Framework. You have mentioned that L_(test) will detect Mode Collapse. How does that happen?
Also, it would be great if you could elaborate on "But this metric introduces the prior
knowledge of the structure of S
0 which is not necessarily encoded in Tsyn."

Thanks

Implementing an early stopping parameter

  • CTGAN version: 0.2.1
  • Python version: Python 3.7
  • Operating System: Google Colab

Description

there doesn't seem to be any clear convergence with increasing epochs after a while. Past approx 30 epochs in the example below, the generator consistently doesn't improve much and the discriminator doesn't seem to show any clear improvement trend either.

Is it reasonable if implemented an early stopping parameter?

What I Did

Using the ctgan demo data setup:

Epoch 1, Loss G: 1.7703, Loss D: -0.4140
Epoch 2, Loss G: 0.9132, Loss D: 0.1742
Epoch 3, Loss G: 0.3882, Loss D: 0.1333
Epoch 4, Loss G: -0.6055, Loss D: 0.0968
Epoch 5, Loss G: -0.6484, Loss D: 0.0696
Epoch 6, Loss G: -1.0512, Loss D: -0.1672
Epoch 7, Loss G: -1.2606, Loss D: 0.0917
Epoch 8, Loss G: -1.3633, Loss D: 0.1726
Epoch 9, Loss G: -1.5499, Loss D: -0.2989
Epoch 10, Loss G: -1.7539, Loss D: -0.0215
Epoch 11, Loss G: -1.9969, Loss D: 0.1057
Epoch 12, Loss G: -1.5544, Loss D: -0.1978
Epoch 13, Loss G: -1.7372, Loss D: 0.0144
Epoch 14, Loss G: -1.7781, Loss D: -0.0129
Epoch 15, Loss G: -1.7676, Loss D: -0.0315
Epoch 16, Loss G: -1.5205, Loss D: -0.0209
Epoch 17, Loss G: -1.4795, Loss D: -0.0970
Epoch 18, Loss G: -1.2830, Loss D: -0.1672
Epoch 19, Loss G: -1.4841, Loss D: -0.1042
Epoch 20, Loss G: -0.9479, Loss D: -0.0559
Epoch 21, Loss G: -1.1393, Loss D: -0.2292
Epoch 22, Loss G: -1.1153, Loss D: -0.0840
Epoch 23, Loss G: -1.1070, Loss D: -0.2443
Epoch 24, Loss G: -1.1734, Loss D: -0.1479
Epoch 25, Loss G: -1.0624, Loss D: -0.0729
Epoch 26, Loss G: -1.0356, Loss D: -0.0566
Epoch 27, Loss G: -0.8900, Loss D: -0.2267
Epoch 28, Loss G: -0.6987, Loss D: -0.1971
Epoch 29, Loss G: -0.7723, Loss D: -0.1828
Epoch 30, Loss G: -0.8304, Loss D: -0.0157
Epoch 31, Loss G: -0.8023, Loss D: -0.0770
Epoch 32, Loss G: -0.5623, Loss D: -0.1800
Epoch 33, Loss G: -0.4977, Loss D: -0.2467
Epoch 34, Loss G: -0.3344, Loss D: -0.2764
Epoch 35, Loss G: -0.4504, Loss D: -0.2840
Epoch 36, Loss G: -0.5648, Loss D: -0.0704
Epoch 37, Loss G: -0.4736, Loss D: -0.1817
Epoch 38, Loss G: -0.5624, Loss D: -0.3123
Epoch 39, Loss G: -0.4873, Loss D: -0.3349
Epoch 40, Loss G: -0.4981, Loss D: -0.3594
Epoch 41, Loss G: -0.5530, Loss D: -0.2552
Epoch 42, Loss G: -0.8578, Loss D: -0.0993
Epoch 43, Loss G: -0.8845, Loss D: -0.0118
Epoch 44, Loss G: -0.5242, Loss D: -0.5416
Epoch 45, Loss G: -0.7406, Loss D: -0.0499
Epoch 46, Loss G: -0.5033, Loss D: -0.1821
Epoch 47, Loss G: -0.5747, Loss D: -0.0890
Epoch 48, Loss G: -0.4117, Loss D: -0.1436
Epoch 49, Loss G: -0.3480, Loss D: -0.3295
Epoch 50, Loss G: -0.1915, Loss D: -0.3795
Epoch 51, Loss G: -0.1754, Loss D: -0.2486
Epoch 52, Loss G: -0.0958, Loss D: -0.5495
Epoch 53, Loss G: -0.1818, Loss D: -0.3954
Epoch 54, Loss G: -0.4569, Loss D: -0.3138
Epoch 55, Loss G: -0.5812, Loss D: -0.2214
Epoch 56, Loss G: -0.3854, Loss D: -0.3159
Epoch 57, Loss G: -0.5512, Loss D: -0.4838
Epoch 58, Loss G: -0.1160, Loss D: -0.3924
Epoch 59, Loss G: -0.5880, Loss D: -0.4459
Epoch 60, Loss G: -0.2418, Loss D: -0.2919
Epoch 61, Loss G: -0.3057, Loss D: 0.0781
Epoch 62, Loss G: -0.3516, Loss D: -0.1905
Epoch 63, Loss G: -0.4931, Loss D: -0.1141
Epoch 64, Loss G: -0.3498, Loss D: -0.0987
Epoch 65, Loss G: -0.3793, Loss D: -0.2559
Epoch 66, Loss G: -0.3495, Loss D: -0.3343
Epoch 67, Loss G: -0.4555, Loss D: -0.2773
Epoch 68, Loss G: -0.2690, Loss D: -0.2480
Epoch 69, Loss G: -0.3576, Loss D: -0.2565
Epoch 70, Loss G: -0.4245, Loss D: -0.2531
Epoch 71, Loss G: -0.3180, Loss D: -0.2390
Epoch 72, Loss G: -0.3671, Loss D: -0.2645
Epoch 73, Loss G: -0.4187, Loss D: 0.0591
Epoch 74, Loss G: -0.4713, Loss D: -0.0732
Epoch 75, Loss G: -0.2109, Loss D: 0.1250
Epoch 76, Loss G: -0.6413, Loss D: 0.0148
Epoch 77, Loss G: -0.3381, Loss D: -0.0976
Epoch 78, Loss G: -0.3368, Loss D: -0.1375
Epoch 79, Loss G: -0.2787, Loss D: -0.1055
Epoch 80, Loss G: -0.5503, Loss D: -0.1662
Epoch 81, Loss G: -0.3985, Loss D: -0.0212
Epoch 82, Loss G: -0.5057, Loss D: -0.3701
Epoch 83, Loss G: -0.4147, Loss D: -0.1540
Epoch 84, Loss G: -0.7346, Loss D: -0.3526
Epoch 85, Loss G: -0.6238, Loss D: -0.3375
Epoch 86, Loss G: -0.5685, Loss D: -0.3796
Epoch 87, Loss G: -0.3179, Loss D: -0.1281
Epoch 88, Loss G: -0.4464, Loss D: -0.3147
Epoch 89, Loss G: -0.3798, Loss D: -0.1411
Epoch 90, Loss G: -0.4874, Loss D: -0.1330
Epoch 91, Loss G: -0.3701, Loss D: -0.2761
Epoch 92, Loss G: -0.4438, Loss D: -0.2787
Epoch 93, Loss G: -0.3701, Loss D: -0.3833
Epoch 94, Loss G: -0.3261, Loss D: -0.2285
Epoch 95, Loss G: -0.4223, Loss D: -0.1745
Epoch 96, Loss G: -0.0369, Loss D: -0.4372
Epoch 97, Loss G: -0.2421, Loss D: -0.2245
Epoch 98, Loss G: -0.4262, Loss D: -0.3120
Epoch 99, Loss G: -0.1879, Loss D: -0.2658
Epoch 100, Loss G: 0.1434, Loss D: -0.2949
Epoch 101, Loss G: 0.0634, Loss D: -0.2145
Epoch 102, Loss G: -0.3514, Loss D: -0.2244
Epoch 103, Loss G: -0.1506, Loss D: -0.1028
Epoch 104, Loss G: -0.1472, Loss D: -0.2615
Epoch 105, Loss G: -0.6987, Loss D: -0.2609
Epoch 106, Loss G: -0.3679, Loss D: -0.3388
Epoch 107, Loss G: -0.3039, Loss D: -0.1347
Epoch 108, Loss G: -0.1479, Loss D: -0.3870
Epoch 109, Loss G: -0.5099, Loss D: -0.3273
Epoch 110, Loss G: -0.7379, Loss D: 0.0718
Epoch 111, Loss G: -0.3503, Loss D: -0.1897
Epoch 112, Loss G: -0.8023, Loss D: -0.2082
Epoch 113, Loss G: -1.1229, Loss D: -0.2555
Epoch 114, Loss G: -0.7507, Loss D: -0.3576
Epoch 115, Loss G: -0.5787, Loss D: -0.1911
Epoch 116, Loss G: -0.5933, Loss D: -0.2165
Epoch 117, Loss G: -0.4494, Loss D: -0.2001
Epoch 118, Loss G: -0.0438, Loss D: -0.0300
Epoch 119, Loss G: -0.5589, Loss D: -0.3625
Epoch 120, Loss G: -0.5553, Loss D: -0.1573
Epoch 121, Loss G: -0.4438, Loss D: -0.2237
Epoch 122, Loss G: -0.5670, Loss D: -0.2057
Epoch 123, Loss G: -0.5754, Loss D: -0.4514
Epoch 124, Loss G: -0.3274, Loss D: 0.0209
Epoch 125, Loss G: -0.3591, Loss D: -0.1435
Epoch 126, Loss G: -0.2459, Loss D: -0.2918
Epoch 127, Loss G: -0.3238, Loss D: -0.6921
Epoch 128, Loss G: -0.2784, Loss D: -0.4857
Epoch 129, Loss G: 0.0843, Loss D: -0.1080
Epoch 130, Loss G: -0.2665, Loss D: -0.1162
Epoch 131, Loss G: -0.2545, Loss D: -0.0934
Epoch 132, Loss G: -0.0762, Loss D: -0.1961
Epoch 133, Loss G: -0.1379, Loss D: -0.3160
Epoch 134, Loss G: -0.0483, Loss D: -0.1415
Epoch 135, Loss G: -0.0881, Loss D: -0.2957
Epoch 136, Loss G: -0.0513, Loss D: -0.3575
Epoch 137, Loss G: -0.5542, Loss D: 0.0639
Epoch 138, Loss G: -0.2930, Loss D: 0.1514
Epoch 139, Loss G: -0.3267, Loss D: -0.4442
Epoch 140, Loss G: -0.4462, Loss D: -0.2140
Epoch 141, Loss G: -0.7850, Loss D: 0.1541
Epoch 142, Loss G: -0.6869, Loss D: -0.1997
Epoch 143, Loss G: -0.5455, Loss D: 0.0698
Epoch 144, Loss G: -0.9430, Loss D: -0.2578
Epoch 145, Loss G: -1.0881, Loss D: -0.0794
Epoch 146, Loss G: -1.0721, Loss D: -0.1857
Epoch 147, Loss G: -0.9823, Loss D: -0.1852
Epoch 148, Loss G: -0.6183, Loss D: -0.1901
Epoch 149, Loss G: -0.5895, Loss D: -0.4376
Epoch 150, Loss G: -0.2892, Loss D: -0.4731
Epoch 151, Loss G: -0.4931, Loss D: -0.1163
Epoch 152, Loss G: -0.6272, Loss D: -0.2558
Epoch 153, Loss G: -0.6970, Loss D: -0.1470
Epoch 154, Loss G: -0.4659, Loss D: -0.0424
Epoch 155, Loss G: -0.8373, Loss D: -0.1084
Epoch 156, Loss G: -0.8139, Loss D: -0.1349
Epoch 157, Loss G: -0.6395, Loss D: 0.0300
Epoch 158, Loss G: -0.6846, Loss D: -0.0564
Epoch 159, Loss G: -0.7239, Loss D: -0.2507
Epoch 160, Loss G: -0.6853, Loss D: -0.1488
Epoch 161, Loss G: -0.4689, Loss D: -0.0547
Epoch 162, Loss G: -0.4398, Loss D: -0.4672
Epoch 163, Loss G: -0.3415, Loss D: -0.2588
Epoch 164, Loss G: -0.6197, Loss D: -0.4692
Epoch 165, Loss G: -0.7124, Loss D: 0.3233
Epoch 166, Loss G: -0.3810, Loss D: -0.2331
Epoch 167, Loss G: -0.2038, Loss D: -0.6168
Epoch 168, Loss G: -0.4305, Loss D: -0.4648
Epoch 169, Loss G: 0.2388, Loss D: 0.0948
Epoch 170, Loss G: -0.5747, Loss D: -0.3976
Epoch 171, Loss G: -0.6325, Loss D: -0.2826
Epoch 172, Loss G: -0.6744, Loss D: 0.0090
Epoch 173, Loss G: -0.5990, Loss D: -0.3130
Epoch 174, Loss G: -0.0799, Loss D: -0.0647
Epoch 175, Loss G: -0.1088, Loss D: -0.3034
Epoch 176, Loss G: -0.5236, Loss D: -0.1425
Epoch 177, Loss G: -0.8875, Loss D: -0.1507
Epoch 178, Loss G: -0.8301, Loss D: -0.4523
Epoch 179, Loss G: -0.9683, Loss D: -0.1435
Epoch 180, Loss G: -0.5027, Loss D: -0.1874
Epoch 181, Loss G: 0.0402, Loss D: -0.5144
Epoch 182, Loss G: -0.3731, Loss D: 0.0271
Epoch 183, Loss G: -0.4859, Loss D: 0.2857
Epoch 184, Loss G: -0.6029, Loss D: 0.1006
Epoch 185, Loss G: -0.4714, Loss D: -0.0931
Epoch 186, Loss G: -0.4526, Loss D: -0.3047
Epoch 187, Loss G: -0.6337, Loss D: -0.5780
Epoch 188, Loss G: -0.4979, Loss D: -0.3884
Epoch 189, Loss G: -0.4898, Loss D: -0.1129
Epoch 190, Loss G: -0.3288, Loss D: -0.4173
Epoch 191, Loss G: -0.5815, Loss D: 0.0365
Epoch 192, Loss G: -0.5618, Loss D: -0.2749
Epoch 193, Loss G: -0.4344, Loss D: -0.2513
Epoch 194, Loss G: -0.4296, Loss D: -0.1379
Epoch 195, Loss G: -0.5141, Loss D: -0.1683
Epoch 196, Loss G: -0.3886, Loss D: 0.0902
Epoch 197, Loss G: -0.1185, Loss D: -0.3041
Epoch 198, Loss G: -0.4368, Loss D: -0.2042
Epoch 199, Loss G: -0.8212, Loss D: -0.2131
Epoch 200, Loss G: -0.7284, Loss D: -0.1206
Epoch 201, Loss G: -0.3844, Loss D: -0.2710
Epoch 202, Loss G: -0.2199, Loss D: -0.2340
Epoch 203, Loss G: -0.4231, Loss D: -0.3787
Epoch 204, Loss G: -0.0590, Loss D: -0.2787
Epoch 205, Loss G: -0.4943, Loss D: -0.2964
Epoch 206, Loss G: -0.5960, Loss D: -0.2904
Epoch 207, Loss G: -0.3736, Loss D: -0.4725
Epoch 208, Loss G: -0.5367, Loss D: -0.0355
Epoch 209, Loss G: -0.5343, Loss D: -0.2065
Epoch 210, Loss G: -0.2958, Loss D: -0.2376
Epoch 211, Loss G: -0.3486, Loss D: -0.2305
Epoch 212, Loss G: -0.0698, Loss D: -0.0447
Epoch 213, Loss G: -0.4837, Loss D: -0.1511
Epoch 214, Loss G: -0.3648, Loss D: -0.2404
Epoch 215, Loss G: -0.3385, Loss D: -0.4328
Epoch 216, Loss G: -0.6249, Loss D: 0.0591
Epoch 217, Loss G: -0.4153, Loss D: -0.1000
Epoch 218, Loss G: -0.1442, Loss D: -0.3652
Epoch 219, Loss G: -0.2455, Loss D: 0.0505
Epoch 220, Loss G: -0.5413, Loss D: -0.2170
Epoch 221, Loss G: -0.6011, Loss D: 0.2106
Epoch 222, Loss G: -0.3802, Loss D: -0.2623
Epoch 223, Loss G: -0.4969, Loss D: -0.1041
Epoch 224, Loss G: -0.6534, Loss D: -0.0594
Epoch 225, Loss G: -0.5426, Loss D: -0.4582
Epoch 226, Loss G: -0.2616, Loss D: -0.1595
Epoch 227, Loss G: -0.3934, Loss D: 0.0174
Epoch 228, Loss G: -0.2554, Loss D: 0.0515
Epoch 229, Loss G: -0.3462, Loss D: -0.2309
Epoch 230, Loss G: -0.6162, Loss D: 0.0820
Epoch 231, Loss G: -0.7277, Loss D: -0.0866
Epoch 232, Loss G: -0.5345, Loss D: -0.1886
Epoch 233, Loss G: -0.3045, Loss D: 0.0544
Epoch 234, Loss G: -0.3265, Loss D: 0.0773
Epoch 235, Loss G: -0.4100, Loss D: 0.0844
Epoch 236, Loss G: -0.4308, Loss D: 0.0168
Epoch 237, Loss G: -0.6521, Loss D: -0.0632
Epoch 238, Loss G: -0.5340, Loss D: -0.0240
Epoch 239, Loss G: -0.4905, Loss D: -0.2091
Epoch 240, Loss G: -0.4170, Loss D: -0.0334
Epoch 241, Loss G: -0.5199, Loss D: 0.1028
Epoch 242, Loss G: -0.4939, Loss D: -0.0611
Epoch 243, Loss G: -0.8483, Loss D: -0.0376
Epoch 244, Loss G: -0.7946, Loss D: 0.0255
Epoch 245, Loss G: -0.5445, Loss D: 0.1557
Epoch 246, Loss G: -0.2708, Loss D: -0.1850
Epoch 247, Loss G: -0.4394, Loss D: -0.1037
Epoch 248, Loss G: -0.3529, Loss D: -0.1705
Epoch 249, Loss G: -0.3365, Loss D: 0.0226
Epoch 250, Loss G: -0.4843, Loss D: 0.0969
Epoch 251, Loss G: -0.4460, Loss D: 0.1088
Epoch 252, Loss G: -0.5106, Loss D: -0.1222
Epoch 253, Loss G: -0.6710, Loss D: 0.1131
Epoch 254, Loss G: -0.6829, Loss D: -0.1389
Epoch 255, Loss G: -0.3559, Loss D: -0.2418
Epoch 256, Loss G: -0.6636, Loss D: -0.1503
Epoch 257, Loss G: -0.5845, Loss D: -0.0170
Epoch 258, Loss G: -0.9466, Loss D: -0.0344
Epoch 259, Loss G: -0.7826, Loss D: 0.0345
Epoch 260, Loss G: -0.8233, Loss D: 0.0200
Epoch 261, Loss G: -0.8138, Loss D: -0.1103
Epoch 262, Loss G: -0.7675, Loss D: -0.1771
Epoch 263, Loss G: -0.6528, Loss D: 0.0845
Epoch 264, Loss G: -0.7947, Loss D: -0.0701
Epoch 265, Loss G: -1.0287, Loss D: -0.0283
Epoch 266, Loss G: -0.5619, Loss D: -0.1113
Epoch 267, Loss G: -0.4039, Loss D: -0.0434
Epoch 268, Loss G: -0.6310, Loss D: -0.1573
Epoch 269, Loss G: -0.8943, Loss D: -0.3958
Epoch 270, Loss G: -0.8277, Loss D: -0.1480
Epoch 271, Loss G: -0.9839, Loss D: -0.1470
Epoch 272, Loss G: -0.4073, Loss D: -0.3034
Epoch 273, Loss G: -0.3445, Loss D: -0.0324
Epoch 274, Loss G: -0.2810, Loss D: -0.2098
Epoch 275, Loss G: -0.2327, Loss D: 0.0027
Epoch 276, Loss G: -0.3266, Loss D: -0.1317
Epoch 277, Loss G: -0.7164, Loss D: -0.0576
Epoch 278, Loss G: -0.9443, Loss D: -0.0400
Epoch 279, Loss G: -0.9377, Loss D: -0.0443
Epoch 280, Loss G: -0.6678, Loss D: -0.2004
Epoch 281, Loss G: -0.8338, Loss D: 0.1530
Epoch 282, Loss G: -0.6346, Loss D: 0.0073
Epoch 283, Loss G: -0.5432, Loss D: -0.2862
Epoch 284, Loss G: -0.4596, Loss D: -0.1362
Epoch 285, Loss G: -0.4565, Loss D: -0.0109
Epoch 286, Loss G: -0.5484, Loss D: -0.0134
Epoch 287, Loss G: -0.9109, Loss D: -0.3872
Epoch 288, Loss G: -0.5126, Loss D: -0.0519
Epoch 289, Loss G: -0.5753, Loss D: -0.2742
Epoch 290, Loss G: -0.4334, Loss D: -0.0615
Epoch 291, Loss G: -0.3641, Loss D: -0.1875
Epoch 292, Loss G: -0.1422, Loss D: -0.2141
Epoch 293, Loss G: -0.7433, Loss D: -0.2379
Epoch 294, Loss G: -0.6039, Loss D: -0.2734
Epoch 295, Loss G: -0.6171, Loss D: -0.0339
Epoch 296, Loss G: -0.8910, Loss D: -0.1048
Epoch 297, Loss G: -1.0306, Loss D: -0.2785
Epoch 298, Loss G: -0.8741, Loss D: 0.0340
Epoch 299, Loss G: -0.6015, Loss D: -0.2684
Epoch 300, Loss G: -0.7981, Loss D: -0.0788
time: 13min 36s

What is the number of training epochs?

  • CTGAN version: latest
  • Python version: 3.7.7
  • Operating System: Windows 10

Description

Not so much an issue but more of a question. What is the default number of training epochs if I don't specify the number?

What I Did


import ctgan as ctgan
import pandas as pd
import numpy as np
#STEP 1: Load data

data = pd.read_csv('D:/test/Machine Learning/FULLDATA.csv')

discrete_columns = list(data.columns)  #selects column names only

#Step 2: Fit CTGAN to your data
#Once you have the data ready, you need to import and create an instance of the CTGANSynthesizer class and fit it passing your data and the list of discrete columns.

from ctgan import CTGANSynthesizer

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns)

#create synthetic data for x number of rows
samples = ctgan.sample(1000)

#save synthetic database to csv
samples.to_csv(r'D:/test/Machine Learning/syntheticdatabase.csv')


Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data

In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.
image

Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?

In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour.
image

Unstable Output

  • Python version: 3.7.3
  • TensorFlow version: 1.14.0

Hi! I am intrigued to work with CTGAN as the purpose of CTGAN exactly matches my goal.

My Dataset has 1846 columns (All continuous) with each column following different distribution and also exhibiting correlations among each other.

I want to expand my dataset (3000 -> 20,000+ samples) so that the new generated dataset follows the same distributions as in the original and in turn also preserves the correlations among the columns.

I am facing the following issues, implementing CTGAN on my dataset:

  1. It runs only below 50 columns, with columns taken more than that, it generates output with only nan.
  2. The outputs are not repetitive/stable, by this I mean, I get result for 50 or less columns in one run and next when I run the model with same data again, I get output as only nan.
  3. In the documentation, there is flag for model path but the model doesn't get saved, also makes sense because I couldn't find any sourcecode using that flag.

I will be grateful if I am given some leads regarding the issues I am facing.

Thanks and kind regards,
Nabaruna

Need help

  • CTGAN version: v0.2.1 - 2020-01-27
  • Python version: 3.7
  • Operating System: Windows 8.1

Description

I need to generate table data similar to input. The input data has two columns like "Company" and "Dept"
CTGAN is generating data randomly. But i need data Specifically like company1 has 4 Departments Which are unique.
When Ever CTGAN Generate data row with company 1 the Department Column should be filled with the 4 unique Departments.

Please help me if there is a way to solve this.

What I Did

question

Easy solution for restoring original dtypes

  • CTGAN version: 2.0.1
  • Python version: 3.7
  • Operating System: MacOS

Description

After having sampled a dataset, we (@oregonpillow and I) encountered the fact that all numerical columns are converted to floats. However, we can simply restore the original dtype after sampling.

What I Did

data_dtype=original_df.dtypes.values
        for i in range(len(sampled_df.columns)):          
       sampled_df[sampled_df.columns[i]]=sampled_df[sampled_df.columns[i]].astype(data_dtype[i])

Question

Is this something we could consider implementing?

Validity of single samples

I'm using CTGAN to generate a synthetic population of travel records. Some of my columns are deterministically correlated, for example, column A + 60 x column B = column C.

After training, the model does not capture these correlations within single samples. This means for a single sample: column A + 60 x column B =/= column C. However, the generated population consisting of many samples captures the correlation as avg(column A) + 60 x avg(column B) = avg(column C).

As I need single samples to be valid, I was wondering if there are parameters in the code that allow to account for more correlation within single samples? Of course for deterministic correlations this does not make sense as they are easy to generate manually afterwards, but there are other correlations in my data for which it does make sense (e.g. age and years of driving experience).

Negative losses

  • CTGAN version: 0.2.1
  • Python version: 3.8
  • Operating System: MacOS

Description

While fitting and training, I get negative losses for Generator & Discriminator. What do negative losses imply? Shouldn't they be >=0?

What I Did

Running a simple fit and train on adults dataset

Applying CTGAN to single colum?

Hi there

I’m wondering if there is a way to use CTGAN on a single column ? For example if I have a dataframe with 100 columns but only want to generate Data for one column by leveraging relationship in other columns ?

Not working with Discrete_columns containing integers

  • CTGAN version: 0.2
  • Python version: 3.7
  • Operating System: Mac Catalina 10.15

Description

The definition of discrete columns is correct on the homepage; stating that discrete columns can indeed be integers or strings. However in practice I have not found the CTGANSynthesizer to work with discrete_columns that contain integers.

What I Did

Using the Census demo dataset, I looked at how many unique values there are for each column.

age 73
workclass 9
fnlwgt 21648
education 16
education-num 16
marital-status 7
occupation 15
relationship 6
race 5
sex 2
capital-gain 119
capital-loss 92
hours-per-week 94
native-country 42
income 2

With the except on 'fnlwgt' which is clearly continuous, it seems odd to me that integer columns like education-num, hours-per-week, capital-loss, capital-gain and even age are not added to discrete_columns too - as a very general rule, if a column contains less that say 5% unique values i'd see it's pretty likely to be discrete in most cases.

Regardless, if I list any integer column within discrete_columns i get errors.
For example, if i add 'education-num' to the discrete_columns list i get this error:

ValueError: could not convert string to float: ' Never-married'

This is strange since the error is not associated with 'education-num' which I just added but with the 'marital-status' column.

Are there any examples of CTGAN working with discrete integer columns?
It seems that the demo definition of discrete is any column containing strings.


Feature request: Joint/3D tabular data

Hi there,

on the way on shaping my own CGAN I stumbled upon your application which is indeed quite meaty and impressive. As you're someway ahead of my application I just wanted to ask whether it might be appendable by a certain feature:

Right now, the generation is 2d, so each line is regarded as a single, independent output. In my respective data set, the data is structured as follows: (object, year, features). It displays the development of features of an object during a certain time frame (e.g. 20 years); some of them are static (like size=1.89,1.89,1.89,1.89...), some subjects to change over a time (like, e.g., the age: 1,2,3,4... or weight=80,78.4,77.2,74.2...). A GAN output would thus ideally (and maybe quite similar to image generation) produce (1,20,feature) outputs that would display a connectedness of data within the year frame. Not certain how to implement this neatly within your code right now, but it would be highly appreciated ;-).

Regards,
Tobias

Reproducibility

  • CTGAN version: 0.2.1
  • Python version: 3.5
  • Operating System: Linux

Description

If I run CTGAN twice with every setting being same and fit on same dataset, while sampling, I get different points and also different losses. How to make it consistent. I tried setting torch random seed but didn't work.

What I Did

from ctgan import CTGANSynthesizer
from sklearn.datasets import make_blobs
X = make_blobs()
ft = X[0]

ctgan = CTGANSynthesizer()
ctgan.fit(ft,epochs=10)
s1=ctgan.sample(100)

ctgan1 = CTGANSynthesizer()
ctgan1.fit(ft,epochs=10)
s2=ctgan1.sample(100)

I want s1 and s2 to be same.

simple script for deploying ctgan onto a server - looking for feedback

Hey guys,
huge fan of the project. I recently deployed ctgan onto a server and wrote some simple scripts to make the deployment easier. It features a simple CLI to prompt the user through the process of creating synthetic data.

if you're interested please check out: https://github.com/oregonpillow/ctgan-server-cli

I'm very new to programming and data science and would really welcome any advice and feedback. So please have mercy on my noob programming / methods of implementation :)
Any advice / constructive feedback would be really appreciated.

-Tim

Any way to fix one or more categorical variable value during the data generating stage?

Thank you for the great works!

In the case when I want to fix one or two categorical columns' value before generating the data. What would be the ideal way to do it? If it is not a current feature, will you consider supporting it in the future?

For example, instead of generating random 1000 samples for the Adult Census Dataset. I want to generate 1000 samples with Income <= 50k

Any suggestion will be much appreciated!

PR for modular transformer

Description

I refactored the transformer class to be more modular for my own work. Would you guys be interested in a PR?
It now has 1 main Transformer (DataTransformer) that uses other more specific transformers for continuous values, discrete values and possibly other values later on (I added dates, for example). These transformers all have abstract methods for fit, transform and inverse_transform.

It is a bit like what you had in TGAN, but more modular. You can swap out every component with custom transformers.

Let me know. It's quite a bit of work to rewrite it too your latest dev branch, so wanted to check first. :)

To give you some idea, my DataTransformer is currently like this:


class DataTransformer(object):
    """Data Transformer.

    Flexible transformer class, that uses specific classes for transforming discrete,
    continuous and date data.
    """

    def __init__(self,
                 n_clusters: int = 10,
                 epsilon: float = 0.005,
                 continuous_transformer: Transformer = None,
                 discrete_transformer: Transformer = None,
                 date_transformer: Transformer = None
                 ):
        """ DataTransformer Init

        Args:
            n_clusters (int, optional): Number of modes. Defaults to 10.
            epsilon (float, optional): Epsilon value for Bayesian Gaussian Mixture Model. Will be ignored if weight of a mode is < epsilon.
                Defaults to 0.005.
            continuous_transformer (Transformer, optional): The continuous transformer that will be used. Defaults to None.
            discrete_transformer (Transformer, optional): The discrete transformer that will be used. Defaults to None.
            date_transformer (Transformer, optional): The date transformer that will be used. Defaults to None.
        """
        self.n_clusters = n_clusters
        self.epsilon = epsilon
        self.discrete_transformer = OneHotTransformer() if discrete_transformer is None else discrete_transformer
        self.continuous_transformer = GMMTransformer(self.n_clusters, self.epsilon, model='gmm') if continuous_transformer is None \
            else continuous_transformer
        self.date_transformer = DateTransformer() if date_transformer is None else date_transformer
        self.output_info = []
        self.output_dimensions = 0

    def fit(self, data: pd.DataFrame, discrete_columns: Tuple = tuple(), date_columns: Tuple = tuple()):
        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data)

        self.dtypes = data.infer_objects().dtypes
        self.meta = []
        for idx, column in enumerate(data.columns):
            column_data = data[[column]].values
            if column in discrete_columns:
                meta = self.discrete_transformer.fit(column, column_data)
            elif column in date_columns:
                meta = self.date_transformer.fit(column, column_data)
            else:
                meta = self.continuous_transformer.fit(column, column_data)

            self.output_info += meta['output_info']
            self.output_dimensions += meta['output_dimensions']
            self.meta.append(meta)

    def transform(self, data: pd.DataFrame) -> np.ndarray:
        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data)

        values = []
        for idx, meta in enumerate(self.meta):
            column_data = data[[meta['name']]].values

            if meta['datatype'] == CONTINUOUS:
                values += self.continuous_transformer.transform(meta, column_data)
            elif meta['datatype'] == DISCRETE:
                values.append(self.discrete_transformer.transform(meta, column_data))
            elif meta['datatype'] == DATE:
                values.append(self.date_transformer.transform(meta, column_data))
            else:
                raise ValueError(f'datatype must be continuous, date or discrete, but is `{meta["datatype"]}`')
        return np.concatenate(values, axis=1).astype(float)

    def fit_transform(self, data, categorical_columns=tuple(), date_columns=tuple()):
        self.fit(data, discrete_columns=categorical_columns, date_columns=date_columns)
        return self.transform(data)

    def inverse_transform(self, data: np.ndarray, sigmas: np.ndarray = None) -> Union[np.ndarray, pd.DataFrame]:
        start = 0
        output = []
        column_names = []
        for meta in self.meta:
            dimensions = meta['output_dimensions']
            columns_data = data[:, start:start + dimensions]

            if meta['datatype'] == CONTINUOUS:
                sigma = sigmas[start] if sigmas else None
                inverted = self.continuous_transformer.inverse_transform(meta, columns_data, sigma)
            elif meta['datatype'] == DISCRETE:
                inverted = self.discrete_transformer.inverse_transform(meta, columns_data)
            elif meta['datatype'] == DATE:
                inverted = self.date_transformer.inverse_transform(meta, columns_data)
            else:
                raise ValueError(f'datatype must be continuous or discrete, but is `{meta["datatype"]}`')

            output.append(inverted)
            column_names.append(meta['name'])
            start += dimensions

        output = {colname: values.reshape(-1) for colname, values in zip(column_names, output)}
        output = pd.DataFrame(output, columns=column_names).astype(self.dtypes)

        if not self.dataframe:
            output = output.values

        return output

Implement model_dir parameter use

  • CTGAN version: stable
  • Python version: 3.7.5
  • Operating System: Ubuntu 18.04

Description

When specifying the model_dir parameter - no model is saved. This is because the parameter is never checked for within the CLI to perform a save operation. Pretty straight forward fix to write a method to accomplish this when you run the CLI with the param. I think I will try and create a MR this week for it.

What I Did

Ran the CLI with example data and the model_dir parameter.

mkdir examples/test_model
python3 -m ctgan.cli --data examples/adult.dat --meta examples/adult.meta --model_dir examples/test_model

Always running into nan losses

  • CTGAN version: 0.2.1
  • Python version: 3.7.7
  • Operating System: Ubuntu 14.04

Description

I'm using the sample code provided on the README to generate sample adult census data. When training the GAN, I run into nan losses every time after the 1st~5th epoch.

image

CUDA out of memory

I have been trying to run a dataset that is ~100k rows with ~40 columns through the synthesizer but am getting

CUDA out of memory. Tried to allocate 2.52 GiB (GPU 0; 11.17 GiB total capacity; 9.79 GiB already allocated; 706.81 MiB free; 10.19 GiB reserved in total by PyTorch)

when I run

ctgan = CTGANSynthesizer(batch_size=50) ctgan.fit(data,discrete_cols,epochs=3,log_frequency=True)

I have reduced the batch size from a default of 500 to 50 and am still getting the above. Is it required to use a very very small batch size here with a large dataset?

I am able to run 10-20k rows just fine but would like to synthesize all available data. Any pointers around running large datasets?

Time Series

Hi do you perhaps know or have you yourself developed a generative model for multivariate time series. Thanks for the package!

Documentation: Guidance about picking hyperparameters

  • CTGAN version: v0.2.1
  • Python version: 3.7.7
  • Operating System: OSX 10.14.6

Description

I am working with a new tabular dataset where it is difficult to evaluate the results.

It has about 2500 rows and 40 columns.

It is difficult to know how to evaluate the model or what hyperparameters to use. Can you provide some guidance in the documentation? For example, is minimizing the loss of G and D the goal? Or what evaluation methods should be used?

handling NaNs and datatypes not perseved

  • CTGAN version: '0.2.2.dev0'
  • Python version: 3.6
  • Operating System: macOS

Description

  1. Through an error on fit if a categorical columns has any NaN
  2. in sampling datatypes returned are not preserved - on Adult data int are returned as categorical

What I Did

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns, epochs=50)

Code implementations of the VGM that aims to estimate the number of modes of the continuous column

It's said in the CTGAN paper that the number of the modes of the continuous column is estimated with variational Gaussian mixtures model (VGM).

However, after going through the code, I could not find the corresponding implementations of the variational Gaussian mixtures model that aims to estimate the number of modes. It seems that the number of modes of continuous columns is set default to 10 according the following snippets.

def __init__(self, n_clusters=10, epsilon=0.005):
self.n_clusters = n_clusters
self.epsilon = epsilon
@ignore_warnings(category=ConvergenceWarning)
def _fit_continuous(self, column, data):
gm = BayesianGaussianMixture(
self.n_clusters,
weight_concentration_prior_type='dirichlet_process',
weight_concentration_prior=0.001,
n_init=1
)

, Or I have missed some other lines that estimate the number of modes.

It will be appreciated if you could resolve my doubts.

Link to R package

I put together an R package to help broaden the reach of y'all's excellent work at https://github.com/kasaai/ctgan, which also powers a short (insurance) industry-specific application paper. Wanted to see if you'd like to link to the package in the README for the useRs that stumble upon the repo :)

@leix28 @csala

How to treat missing continuous data in data set

  • CTGAN version:0.2.2
  • Python version: 3.6.8
  • Operating System: Tensorflow Docker

Description

Hi, I'm curious how to treat missing continuous values in a training Dataset. Using a placeholder that was used in the demo file ('?') for missing discrete values won't work. Any suggestions, how to deal with this issue?
Kind Regards!

Cant install CTGAN

  • CTGAN version: latest
  • Python version: 3.7.7
  • Operating System: Windows 10 home

Description

When trying to install ctgan i get the following error:

ERROR: Could not find a version that satisfies the requirement torch<2,>=1.0 (from ctgan) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch<2,>=1.0 (from ctgan)

What I Did

pip install ctgan

I also banged my fist on the desk...

Not able to fit() CTGANSynthesizer() if NaNs are present in the dataset

  • CTGAN version:
  • Python version:
  • Operating System:

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.