sdv-dev / ctgan Goto Github PK

View Code? Open in Web Editor NEW

1.1K 21.0 271.0 1.8 MB

Conditional GAN for generating synthetic tabular data.

License: Other

Makefile 5.44% Python 94.56%

synthetic-data generative-adversarial-network tabular-data data-generation synthetic-data-generation

ctgan's People

Contributors

Stargazers

Watchers

Forkers

nvanommeren kevinykuo linxiyao csala pythiac jjdiezm mendessp elesa sudhakarkr kkw818 kasra-hosseini kierancc vampypandya oregonpillow spring-epfl firstgod1 marieleyse vivonasgv ssierral genomics4ai sbuttler aprilxiaoyanliu juan-carlos-calvo seansaito jwmueller mahmoodm2 seondong fibarrola zeta1999 martinhavlicek ljk423 adamfinastra lurosenb ashiakerwang bobotran yschoi-github pfoy chrinide jsweng jerronl anwarkhan345 ludovicc joanvaquer caoshuang888 timvink nosfairal sahanalva transconnectome mltlachac noctillion prasad3000 calzoom fuyuan-li jiannan0721 paul-bradbeer-adv elisim peppahashfly lizhipro mberrett mpranchet agasiherbert waihoh smyng91 pahal2007 wps1215 lilyfallasleep vishalssharma kamkare eddiepbc mneunhoe ivanvaccarics rch boonyew metavai jhachenbergersit cfraser497 wecandoitkorea hejiaxing97 shaunmalti fzhurd vwl ashvithashetty sandy4321 mykrass louispaban abdulmateen59 echizeng ppeddada97 amarjitghuman gieoon yangchenghuang florentramb kpanjwan ricklentz lmassaron maulberto3 lynch829 sbrugman shuvoworld abedshantti

ctgan's Issues

Targeted sample generation

Instead of random sample generation is it possible to create samples based on predefined inputs (discrete or continuous). So for the adult dataset if I wanted my sample to be of age X and from country Y and have the GAN generate the rest of the characteristics is that possible?

Without Discrete columns the Loss increase strongly

Description

I have fitted the Synthesizer with the same dataset several times, making litle changes.
Always, all the epochs have had losses (Loss G & Loss D) under |4|. None of the maximum numbers passed that threshold. And, in 300 epoch I usually get less than 0.8 .
But when I train the model without the only discrete column that I have in my dataset, both losses increase very much. The Loss G peaked at -95 in epoch 43. For other hand, the loss D peaked at -2333 in epoch 37, and it scored -9 in epoch 300

What I Did

I created a discrete column with only one label, and the losses decreses to the normal numbers.
Do you know what could have happened? Why the losses are increasing in that way when I take out the only discrete column?

Check discrete_columns valid before fitting

As @csala mentioned in #24 it would be good to check that discrete_columns list is valid at the beginning of fitting instead of silently ignoring invalid columns then throwing an error later in the fitting process.

What I Did

Would something similar to this at the beginning of fitfunction work? :

for col in discrete_columns:
  if col not in data.columns:
    print("*discrete_columns error*")
    print(col + " not found in data")
    col_error = True
  else:
    col_error = False
if col_error is True:
  sys.exit()

Reorganize the project structure

A few changes can be made to improve the overall code maintainability:

Modularize the code, separating code dedicated to specific topics on different modules.
Split some parts of the code in smaller, single purpose methods.
Rename the Cond class to ConditionalGenerator.
Make random_choice_prob_index a method of the ConditionalGenerator class.
Make calc_gradient_penalty a method of the Discriminator class.
Make cond_loss a CTGANSynthesizer method.
Make a _build_models method where all the internal models are built. See #4
Separate the fitting process in two methods:
- One with categorical variables, which uses the cond vector
- Another one without categorical variables.
Rename some variables to be a bit more verbose
Replace plain asserts with user more verbose and type specific exceptions
Add docstrings to all the public methods
Remake cli.py as __main__.py and use argparse instead of absl
Move epochs arg to fit method. See #5

The new project structure should look as follows:

ctgan
├── __init__.py
├── __main__.py: New CLI module
├── conditional.py: ConditionalGenerator
├── data.py: Methods to write and read data
├── models.py: Discriminator, Generator and Residual classes
├── sampler.py: Sampler class
├── synthesizer.py: CTGANSynthesizer class
└── transformer.py: DataTransformer class

Differential privacy

Does this implementation guarantee differential privacy? This is mentioned in the paper as possible with CTGAN.

Is synthetic data always anonymous?

One of the major use cases for synthetic tabular data is creating anonymous data for others to work with.

I was wondering if it's possible by chance during sampling that a sampled DF contains rows which are similar enough to some of the original DF rows which could allow someone/something to be identified?

E.g. For a given DF containing columns: age , education, race, test-score, .....etc

Perhaps age, education, race, test-score are the only fields required to identify someone. It is therefore important that your sampled data does not contain a combination of age, education, race, test-score that is also present in original DF.

I understand that this check can easily be performed outside of CTGAN, but in terms of how CTGAN works, is it possible CTGAN generates personal identifiable information by chance during sampling?

Gaussian mixture params/training sample size

Any tips on optimizing performance/training time for the Bayesian Gaussian mixture training phase? Could we consider exposing the parameters and perhaps include sampling the training set? This piece doesn't seem to scale well to bigger datasets.

Saving a CTGAN model

CTGAN version:
Python version:
Operating System:

Description

I would like to save in disk a CTGAN trained model. I did it in the past using TGAN using tgan.save command.

What I Did

It seems it's not possible to save models

Thanks for your jelp

Doubt with Likelihood fitness metric

Hi,

I am trying to understand the Evaluation Metrics and Framework. You have mentioned that L_(test) will detect Mode Collapse. How does that happen?
Also, it would be great if you could elaborate on "But this metric introduces the prior
knowledge of the structure of S
0 which is not necessarily encoded in Tsyn."

Thanks

Implementing an early stopping parameter

CTGAN version: 0.2.1
Python version: Python 3.7
Operating System: Google Colab

Description

there doesn't seem to be any clear convergence with increasing epochs after a while. Past approx 30 epochs in the example below, the generator consistently doesn't improve much and the discriminator doesn't seem to show any clear improvement trend either.

Is it reasonable if implemented an early stopping parameter?

What I Did

Using the ctgan demo data setup:

Epoch 1, Loss G: 1.7703, Loss D: -0.4140
Epoch 2, Loss G: 0.9132, Loss D: 0.1742
Epoch 3, Loss G: 0.3882, Loss D: 0.1333
Epoch 4, Loss G: -0.6055, Loss D: 0.0968
Epoch 5, Loss G: -0.6484, Loss D: 0.0696
Epoch 6, Loss G: -1.0512, Loss D: -0.1672
Epoch 7, Loss G: -1.2606, Loss D: 0.0917
Epoch 8, Loss G: -1.3633, Loss D: 0.1726
Epoch 9, Loss G: -1.5499, Loss D: -0.2989
Epoch 10, Loss G: -1.7539, Loss D: -0.0215
Epoch 11, Loss G: -1.9969, Loss D: 0.1057
Epoch 12, Loss G: -1.5544, Loss D: -0.1978
Epoch 13, Loss G: -1.7372, Loss D: 0.0144
Epoch 14, Loss G: -1.7781, Loss D: -0.0129
Epoch 15, Loss G: -1.7676, Loss D: -0.0315
Epoch 16, Loss G: -1.5205, Loss D: -0.0209
Epoch 17, Loss G: -1.4795, Loss D: -0.0970
Epoch 18, Loss G: -1.2830, Loss D: -0.1672
Epoch 19, Loss G: -1.4841, Loss D: -0.1042
Epoch 20, Loss G: -0.9479, Loss D: -0.0559
Epoch 21, Loss G: -1.1393, Loss D: -0.2292
Epoch 22, Loss G: -1.1153, Loss D: -0.0840
Epoch 23, Loss G: -1.1070, Loss D: -0.2443
Epoch 24, Loss G: -1.1734, Loss D: -0.1479
Epoch 25, Loss G: -1.0624, Loss D: -0.0729
Epoch 26, Loss G: -1.0356, Loss D: -0.0566
Epoch 27, Loss G: -0.8900, Loss D: -0.2267
Epoch 28, Loss G: -0.6987, Loss D: -0.1971
Epoch 29, Loss G: -0.7723, Loss D: -0.1828
Epoch 30, Loss G: -0.8304, Loss D: -0.0157
Epoch 31, Loss G: -0.8023, Loss D: -0.0770
Epoch 32, Loss G: -0.5623, Loss D: -0.1800
Epoch 33, Loss G: -0.4977, Loss D: -0.2467
Epoch 34, Loss G: -0.3344, Loss D: -0.2764
Epoch 35, Loss G: -0.4504, Loss D: -0.2840
Epoch 36, Loss G: -0.5648, Loss D: -0.0704
Epoch 37, Loss G: -0.4736, Loss D: -0.1817
Epoch 38, Loss G: -0.5624, Loss D: -0.3123
Epoch 39, Loss G: -0.4873, Loss D: -0.3349
Epoch 40, Loss G: -0.4981, Loss D: -0.3594
Epoch 41, Loss G: -0.5530, Loss D: -0.2552
Epoch 42, Loss G: -0.8578, Loss D: -0.0993
Epoch 43, Loss G: -0.8845, Loss D: -0.0118
Epoch 44, Loss G: -0.5242, Loss D: -0.5416
Epoch 45, Loss G: -0.7406, Loss D: -0.0499
Epoch 46, Loss G: -0.5033, Loss D: -0.1821
Epoch 47, Loss G: -0.5747, Loss D: -0.0890
Epoch 48, Loss G: -0.4117, Loss D: -0.1436
Epoch 49, Loss G: -0.3480, Loss D: -0.3295
Epoch 50, Loss G: -0.1915, Loss D: -0.3795
Epoch 51, Loss G: -0.1754, Loss D: -0.2486
Epoch 52, Loss G: -0.0958, Loss D: -0.5495
Epoch 53, Loss G: -0.1818, Loss D: -0.3954
Epoch 54, Loss G: -0.4569, Loss D: -0.3138
Epoch 55, Loss G: -0.5812, Loss D: -0.2214
Epoch 56, Loss G: -0.3854, Loss D: -0.3159
Epoch 57, Loss G: -0.5512, Loss D: -0.4838
Epoch 58, Loss G: -0.1160, Loss D: -0.3924
Epoch 59, Loss G: -0.5880, Loss D: -0.4459
Epoch 60, Loss G: -0.2418, Loss D: -0.2919
Epoch 61, Loss G: -0.3057, Loss D: 0.0781
Epoch 62, Loss G: -0.3516, Loss D: -0.1905
Epoch 63, Loss G: -0.4931, Loss D: -0.1141
Epoch 64, Loss G: -0.3498, Loss D: -0.0987
Epoch 65, Loss G: -0.3793, Loss D: -0.2559
Epoch 66, Loss G: -0.3495, Loss D: -0.3343
Epoch 67, Loss G: -0.4555, Loss D: -0.2773
Epoch 68, Loss G: -0.2690, Loss D: -0.2480
Epoch 69, Loss G: -0.3576, Loss D: -0.2565
Epoch 70, Loss G: -0.4245, Loss D: -0.2531
Epoch 71, Loss G: -0.3180, Loss D: -0.2390
Epoch 72, Loss G: -0.3671, Loss D: -0.2645
Epoch 73, Loss G: -0.4187, Loss D: 0.0591
Epoch 74, Loss G: -0.4713, Loss D: -0.0732
Epoch 75, Loss G: -0.2109, Loss D: 0.1250
Epoch 76, Loss G: -0.6413, Loss D: 0.0148
Epoch 77, Loss G: -0.3381, Loss D: -0.0976
Epoch 78, Loss G: -0.3368, Loss D: -0.1375
Epoch 79, Loss G: -0.2787, Loss D: -0.1055
Epoch 80, Loss G: -0.5503, Loss D: -0.1662
Epoch 81, Loss G: -0.3985, Loss D: -0.0212
Epoch 82, Loss G: -0.5057, Loss D: -0.3701
Epoch 83, Loss G: -0.4147, Loss D: -0.1540
Epoch 84, Loss G: -0.7346, Loss D: -0.3526
Epoch 85, Loss G: -0.6238, Loss D: -0.3375
Epoch 86, Loss G: -0.5685, Loss D: -0.3796
Epoch 87, Loss G: -0.3179, Loss D: -0.1281
Epoch 88, Loss G: -0.4464, Loss D: -0.3147
Epoch 89, Loss G: -0.3798, Loss D: -0.1411
Epoch 90, Loss G: -0.4874, Loss D: -0.1330
Epoch 91, Loss G: -0.3701, Loss D: -0.2761
Epoch 92, Loss G: -0.4438, Loss D: -0.2787
Epoch 93, Loss G: -0.3701, Loss D: -0.3833
Epoch 94, Loss G: -0.3261, Loss D: -0.2285
Epoch 95, Loss G: -0.4223, Loss D: -0.1745
Epoch 96, Loss G: -0.0369, Loss D: -0.4372
Epoch 97, Loss G: -0.2421, Loss D: -0.2245
Epoch 98, Loss G: -0.4262, Loss D: -0.3120
Epoch 99, Loss G: -0.1879, Loss D: -0.2658
Epoch 100, Loss G: 0.1434, Loss D: -0.2949
Epoch 101, Loss G: 0.0634, Loss D: -0.2145
Epoch 102, Loss G: -0.3514, Loss D: -0.2244
Epoch 103, Loss G: -0.1506, Loss D: -0.1028
Epoch 104, Loss G: -0.1472, Loss D: -0.2615
Epoch 105, Loss G: -0.6987, Loss D: -0.2609
Epoch 106, Loss G: -0.3679, Loss D: -0.3388
Epoch 107, Loss G: -0.3039, Loss D: -0.1347
Epoch 108, Loss G: -0.1479, Loss D: -0.3870
Epoch 109, Loss G: -0.5099, Loss D: -0.3273
Epoch 110, Loss G: -0.7379, Loss D: 0.0718
Epoch 111, Loss G: -0.3503, Loss D: -0.1897
Epoch 112, Loss G: -0.8023, Loss D: -0.2082
Epoch 113, Loss G: -1.1229, Loss D: -0.2555
Epoch 114, Loss G: -0.7507, Loss D: -0.3576
Epoch 115, Loss G: -0.5787, Loss D: -0.1911
Epoch 116, Loss G: -0.5933, Loss D: -0.2165
Epoch 117, Loss G: -0.4494, Loss D: -0.2001
Epoch 118, Loss G: -0.0438, Loss D: -0.0300
Epoch 119, Loss G: -0.5589, Loss D: -0.3625
Epoch 120, Loss G: -0.5553, Loss D: -0.1573
Epoch 121, Loss G: -0.4438, Loss D: -0.2237
Epoch 122, Loss G: -0.5670, Loss D: -0.2057
Epoch 123, Loss G: -0.5754, Loss D: -0.4514
Epoch 124, Loss G: -0.3274, Loss D: 0.0209
Epoch 125, Loss G: -0.3591, Loss D: -0.1435
Epoch 126, Loss G: -0.2459, Loss D: -0.2918
Epoch 127, Loss G: -0.3238, Loss D: -0.6921
Epoch 128, Loss G: -0.2784, Loss D: -0.4857
Epoch 129, Loss G: 0.0843, Loss D: -0.1080
Epoch 130, Loss G: -0.2665, Loss D: -0.1162
Epoch 131, Loss G: -0.2545, Loss D: -0.0934
Epoch 132, Loss G: -0.0762, Loss D: -0.1961
Epoch 133, Loss G: -0.1379, Loss D: -0.3160
Epoch 134, Loss G: -0.0483, Loss D: -0.1415
Epoch 135, Loss G: -0.0881, Loss D: -0.2957
Epoch 136, Loss G: -0.0513, Loss D: -0.3575
Epoch 137, Loss G: -0.5542, Loss D: 0.0639
Epoch 138, Loss G: -0.2930, Loss D: 0.1514
Epoch 139, Loss G: -0.3267, Loss D: -0.4442
Epoch 140, Loss G: -0.4462, Loss D: -0.2140
Epoch 141, Loss G: -0.7850, Loss D: 0.1541
Epoch 142, Loss G: -0.6869, Loss D: -0.1997
Epoch 143, Loss G: -0.5455, Loss D: 0.0698
Epoch 144, Loss G: -0.9430, Loss D: -0.2578
Epoch 145, Loss G: -1.0881, Loss D: -0.0794
Epoch 146, Loss G: -1.0721, Loss D: -0.1857
Epoch 147, Loss G: -0.9823, Loss D: -0.1852
Epoch 148, Loss G: -0.6183, Loss D: -0.1901
Epoch 149, Loss G: -0.5895, Loss D: -0.4376
Epoch 150, Loss G: -0.2892, Loss D: -0.4731
Epoch 151, Loss G: -0.4931, Loss D: -0.1163
Epoch 152, Loss G: -0.6272, Loss D: -0.2558
Epoch 153, Loss G: -0.6970, Loss D: -0.1470
Epoch 154, Loss G: -0.4659, Loss D: -0.0424
Epoch 155, Loss G: -0.8373, Loss D: -0.1084
Epoch 156, Loss G: -0.8139, Loss D: -0.1349
Epoch 157, Loss G: -0.6395, Loss D: 0.0300
Epoch 158, Loss G: -0.6846, Loss D: -0.0564
Epoch 159, Loss G: -0.7239, Loss D: -0.2507
Epoch 160, Loss G: -0.6853, Loss D: -0.1488
Epoch 161, Loss G: -0.4689, Loss D: -0.0547
Epoch 162, Loss G: -0.4398, Loss D: -0.4672
Epoch 163, Loss G: -0.3415, Loss D: -0.2588
Epoch 164, Loss G: -0.6197, Loss D: -0.4692
Epoch 165, Loss G: -0.7124, Loss D: 0.3233
Epoch 166, Loss G: -0.3810, Loss D: -0.2331
Epoch 167, Loss G: -0.2038, Loss D: -0.6168
Epoch 168, Loss G: -0.4305, Loss D: -0.4648
Epoch 169, Loss G: 0.2388, Loss D: 0.0948
Epoch 170, Loss G: -0.5747, Loss D: -0.3976
Epoch 171, Loss G: -0.6325, Loss D: -0.2826
Epoch 172, Loss G: -0.6744, Loss D: 0.0090
Epoch 173, Loss G: -0.5990, Loss D: -0.3130
Epoch 174, Loss G: -0.0799, Loss D: -0.0647
Epoch 175, Loss G: -0.1088, Loss D: -0.3034
Epoch 176, Loss G: -0.5236, Loss D: -0.1425
Epoch 177, Loss G: -0.8875, Loss D: -0.1507
Epoch 178, Loss G: -0.8301, Loss D: -0.4523
Epoch 179, Loss G: -0.9683, Loss D: -0.1435
Epoch 180, Loss G: -0.5027, Loss D: -0.1874
Epoch 181, Loss G: 0.0402, Loss D: -0.5144
Epoch 182, Loss G: -0.3731, Loss D: 0.0271
Epoch 183, Loss G: -0.4859, Loss D: 0.2857
Epoch 184, Loss G: -0.6029, Loss D: 0.1006
Epoch 185, Loss G: -0.4714, Loss D: -0.0931
Epoch 186, Loss G: -0.4526, Loss D: -0.3047
Epoch 187, Loss G: -0.6337, Loss D: -0.5780
Epoch 188, Loss G: -0.4979, Loss D: -0.3884
Epoch 189, Loss G: -0.4898, Loss D: -0.1129
Epoch 190, Loss G: -0.3288, Loss D: -0.4173
Epoch 191, Loss G: -0.5815, Loss D: 0.0365
Epoch 192, Loss G: -0.5618, Loss D: -0.2749
Epoch 193, Loss G: -0.4344, Loss D: -0.2513
Epoch 194, Loss G: -0.4296, Loss D: -0.1379
Epoch 195, Loss G: -0.5141, Loss D: -0.1683
Epoch 196, Loss G: -0.3886, Loss D: 0.0902
Epoch 197, Loss G: -0.1185, Loss D: -0.3041
Epoch 198, Loss G: -0.4368, Loss D: -0.2042
Epoch 199, Loss G: -0.8212, Loss D: -0.2131
Epoch 200, Loss G: -0.7284, Loss D: -0.1206
Epoch 201, Loss G: -0.3844, Loss D: -0.2710
Epoch 202, Loss G: -0.2199, Loss D: -0.2340
Epoch 203, Loss G: -0.4231, Loss D: -0.3787
Epoch 204, Loss G: -0.0590, Loss D: -0.2787
Epoch 205, Loss G: -0.4943, Loss D: -0.2964
Epoch 206, Loss G: -0.5960, Loss D: -0.2904
Epoch 207, Loss G: -0.3736, Loss D: -0.4725
Epoch 208, Loss G: -0.5367, Loss D: -0.0355
Epoch 209, Loss G: -0.5343, Loss D: -0.2065
Epoch 210, Loss G: -0.2958, Loss D: -0.2376
Epoch 211, Loss G: -0.3486, Loss D: -0.2305
Epoch 212, Loss G: -0.0698, Loss D: -0.0447
Epoch 213, Loss G: -0.4837, Loss D: -0.1511
Epoch 214, Loss G: -0.3648, Loss D: -0.2404
Epoch 215, Loss G: -0.3385, Loss D: -0.4328
Epoch 216, Loss G: -0.6249, Loss D: 0.0591
Epoch 217, Loss G: -0.4153, Loss D: -0.1000
Epoch 218, Loss G: -0.1442, Loss D: -0.3652
Epoch 219, Loss G: -0.2455, Loss D: 0.0505
Epoch 220, Loss G: -0.5413, Loss D: -0.2170
Epoch 221, Loss G: -0.6011, Loss D: 0.2106
Epoch 222, Loss G: -0.3802, Loss D: -0.2623
Epoch 223, Loss G: -0.4969, Loss D: -0.1041
Epoch 224, Loss G: -0.6534, Loss D: -0.0594
Epoch 225, Loss G: -0.5426, Loss D: -0.4582
Epoch 226, Loss G: -0.2616, Loss D: -0.1595
Epoch 227, Loss G: -0.3934, Loss D: 0.0174
Epoch 228, Loss G: -0.2554, Loss D: 0.0515
Epoch 229, Loss G: -0.3462, Loss D: -0.2309
Epoch 230, Loss G: -0.6162, Loss D: 0.0820
Epoch 231, Loss G: -0.7277, Loss D: -0.0866
Epoch 232, Loss G: -0.5345, Loss D: -0.1886
Epoch 233, Loss G: -0.3045, Loss D: 0.0544
Epoch 234, Loss G: -0.3265, Loss D: 0.0773
Epoch 235, Loss G: -0.4100, Loss D: 0.0844
Epoch 236, Loss G: -0.4308, Loss D: 0.0168
Epoch 237, Loss G: -0.6521, Loss D: -0.0632
Epoch 238, Loss G: -0.5340, Loss D: -0.0240
Epoch 239, Loss G: -0.4905, Loss D: -0.2091
Epoch 240, Loss G: -0.4170, Loss D: -0.0334
Epoch 241, Loss G: -0.5199, Loss D: 0.1028
Epoch 242, Loss G: -0.4939, Loss D: -0.0611
Epoch 243, Loss G: -0.8483, Loss D: -0.0376
Epoch 244, Loss G: -0.7946, Loss D: 0.0255
Epoch 245, Loss G: -0.5445, Loss D: 0.1557
Epoch 246, Loss G: -0.2708, Loss D: -0.1850
Epoch 247, Loss G: -0.4394, Loss D: -0.1037
Epoch 248, Loss G: -0.3529, Loss D: -0.1705
Epoch 249, Loss G: -0.3365, Loss D: 0.0226
Epoch 250, Loss G: -0.4843, Loss D: 0.0969
Epoch 251, Loss G: -0.4460, Loss D: 0.1088
Epoch 252, Loss G: -0.5106, Loss D: -0.1222
Epoch 253, Loss G: -0.6710, Loss D: 0.1131
Epoch 254, Loss G: -0.6829, Loss D: -0.1389
Epoch 255, Loss G: -0.3559, Loss D: -0.2418
Epoch 256, Loss G: -0.6636, Loss D: -0.1503
Epoch 257, Loss G: -0.5845, Loss D: -0.0170
Epoch 258, Loss G: -0.9466, Loss D: -0.0344
Epoch 259, Loss G: -0.7826, Loss D: 0.0345
Epoch 260, Loss G: -0.8233, Loss D: 0.0200
Epoch 261, Loss G: -0.8138, Loss D: -0.1103
Epoch 262, Loss G: -0.7675, Loss D: -0.1771
Epoch 263, Loss G: -0.6528, Loss D: 0.0845
Epoch 264, Loss G: -0.7947, Loss D: -0.0701
Epoch 265, Loss G: -1.0287, Loss D: -0.0283
Epoch 266, Loss G: -0.5619, Loss D: -0.1113
Epoch 267, Loss G: -0.4039, Loss D: -0.0434
Epoch 268, Loss G: -0.6310, Loss D: -0.1573
Epoch 269, Loss G: -0.8943, Loss D: -0.3958
Epoch 270, Loss G: -0.8277, Loss D: -0.1480
Epoch 271, Loss G: -0.9839, Loss D: -0.1470
Epoch 272, Loss G: -0.4073, Loss D: -0.3034
Epoch 273, Loss G: -0.3445, Loss D: -0.0324
Epoch 274, Loss G: -0.2810, Loss D: -0.2098
Epoch 275, Loss G: -0.2327, Loss D: 0.0027
Epoch 276, Loss G: -0.3266, Loss D: -0.1317
Epoch 277, Loss G: -0.7164, Loss D: -0.0576
Epoch 278, Loss G: -0.9443, Loss D: -0.0400
Epoch 279, Loss G: -0.9377, Loss D: -0.0443
Epoch 280, Loss G: -0.6678, Loss D: -0.2004
Epoch 281, Loss G: -0.8338, Loss D: 0.1530
Epoch 282, Loss G: -0.6346, Loss D: 0.0073
Epoch 283, Loss G: -0.5432, Loss D: -0.2862
Epoch 284, Loss G: -0.4596, Loss D: -0.1362
Epoch 285, Loss G: -0.4565, Loss D: -0.0109
Epoch 286, Loss G: -0.5484, Loss D: -0.0134
Epoch 287, Loss G: -0.9109, Loss D: -0.3872
Epoch 288, Loss G: -0.5126, Loss D: -0.0519
Epoch 289, Loss G: -0.5753, Loss D: -0.2742
Epoch 290, Loss G: -0.4334, Loss D: -0.0615
Epoch 291, Loss G: -0.3641, Loss D: -0.1875
Epoch 292, Loss G: -0.1422, Loss D: -0.2141
Epoch 293, Loss G: -0.7433, Loss D: -0.2379
Epoch 294, Loss G: -0.6039, Loss D: -0.2734
Epoch 295, Loss G: -0.6171, Loss D: -0.0339
Epoch 296, Loss G: -0.8910, Loss D: -0.1048
Epoch 297, Loss G: -1.0306, Loss D: -0.2785
Epoch 298, Loss G: -0.8741, Loss D: 0.0340
Epoch 299, Loss G: -0.6015, Loss D: -0.2684
Epoch 300, Loss G: -0.7981, Loss D: -0.0788
time: 13min 36s

What is the number of training epochs?

CTGAN version: latest
Python version: 3.7.7
Operating System: Windows 10

Description

Not so much an issue but more of a question. What is the default number of training epochs if I don't specify the number?

What I Did


import ctgan as ctgan
import pandas as pd
import numpy as np
#STEP 1: Load data

data = pd.read_csv('D:/test/Machine Learning/FULLDATA.csv')

discrete_columns = list(data.columns)  #selects column names only

#Step 2: Fit CTGAN to your data
#Once you have the data ready, you need to import and create an instance of the CTGANSynthesizer class and fit it passing your data and the list of discrete columns.

from ctgan import CTGANSynthesizer

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns)

#create synthetic data for x number of rows
samples = ctgan.sample(1000)

#save synthetic database to csv
samples.to_csv(r'D:/test/Machine Learning/syntheticdatabase.csv')

Can developers get more documentation for better understanding the code

hi, I came across this repo few days ago I'm really impressed by the outputs of the model but I'm having a very hard time understanding the whole code in a clear manner. Is it possible for you guys to help me out.

Trying to modify batch size in source code (synthesizer.py) from 500 to 16 and getting 'Assertion error'

CTGAN version: 0.2.1
Python version: 3.7
Operating System: Windows 10

Description

My raw data file to very small so I was trying to generate synthetic data with modified source code with batch size=16 but I am getting assertion error. Please let me know what should I do.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Consider initiating generator with synthesizer

Not too familiar with pytorch so let me know if this makes sense...

It seems like we're instantiating the model each time fit() is called https://github.com/DAI-Lab/CTGAN/blob/7aa29685045ffdba84bd87432354c133e05699e6/ctgan/ctgan_model.py#L458-L465
Would it make sense to do this once when we instantiate CTGANSynthesizer so we can e.g. look at the behavior of generated data as we train for more epochs?

TensorFlow 2.X Implementation

Flush stdout buffer for epoch updates

Could we add a flush=True to the print() call at

CTGAN/ctgan/synthesizer.py

Lines 217 to 218 in 152da67

 print("Epoch %d, Loss G: %.4f, Loss D: %.4f" % 

 (i + 1, loss_g.detach().cpu(), loss_d.detach().cpu()))

This would unblock the messages to the R console.

Consider verbosity parameter for per-epoch losses

Either on/off or maybe a frequency (e.g. every N epochs)

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data

In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.

Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?

In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour.

Retention of gradient penalty Graph

Hi guys,

What is the role of retaining the graph of the gradient penalty after doing the backward pass? I couldn't find any other usage after that so I was wondering if this was still doing something? Thanks!

CTGAN/ctgan/synthesizer.py

Line 190 in f71c8fa

pen.backward(retain_graph=True)

Unstable Output

Python version: 3.7.3
TensorFlow version: 1.14.0

Hi! I am intrigued to work with CTGAN as the purpose of CTGAN exactly matches my goal.

My Dataset has 1846 columns (All continuous) with each column following different distribution and also exhibiting correlations among each other.

I want to expand my dataset (3000 -> 20,000+ samples) so that the new generated dataset follows the same distributions as in the original and in turn also preserves the correlations among the columns.

I am facing the following issues, implementing CTGAN on my dataset:

It runs only below 50 columns, with columns taken more than that, it generates output with only nan.
The outputs are not repetitive/stable, by this I mean, I get result for 50 or less columns in one run and next when I run the model with same data again, I get output as only nan.
In the documentation, there is flag for model path but the model doesn't get saved, also makes sense because I couldn't find any sourcecode using that flag.

I will be grateful if I am given some leads regarding the issues I am facing.

Thanks and kind regards,
Nabaruna

Publish to conda repo

Is this package available in any of the public conda channels?

Need help

CTGAN version: v0.2.1 - 2020-01-27
Python version: 3.7
Operating System: Windows 8.1

Description

I need to generate table data similar to input. The input data has two columns like "Company" and "Dept"
CTGAN is generating data randomly. But i need data Specifically like company1 has 4 Departments Which are unique.
When Ever CTGAN Generate data row with company 1 the Department Column should be filled with the 4 unique Departments.

Please help me if there is a way to solve this.

What I Did

Easy solution for restoring original dtypes

CTGAN version: 2.0.1
Python version: 3.7
Operating System: MacOS

Description

After having sampled a dataset, we (@oregonpillow and I) encountered the fact that all numerical columns are converted to floats. However, we can simply restore the original dtype after sampling.

What I Did

data_dtype=original_df.dtypes.values
        for i in range(len(sampled_df.columns)):          
       sampled_df[sampled_df.columns[i]]=sampled_df[sampled_df.columns[i]].astype(data_dtype[i])

Question

Is this something we could consider implementing?

Validity of single samples

I'm using CTGAN to generate a synthetic population of travel records. Some of my columns are deterministically correlated, for example, column A + 60 x column B = column C.

After training, the model does not capture these correlations within single samples. This means for a single sample: column A + 60 x column B =/= column C. However, the generated population consisting of many samples captures the correlation as avg(column A) + 60 x avg(column B) = avg(column C).

As I need single samples to be valid, I was wondering if there are parameters in the code that allow to account for more correlation within single samples? Of course for deterministic correlations this does not make sense as they are easy to generate manually afterwards, but there are other correlations in my data for which it does make sense (e.g. age and years of driving experience).

Negative losses

CTGAN version: 0.2.1
Python version: 3.8
Operating System: MacOS

Description

While fitting and training, I get negative losses for Generator & Discriminator. What do negative losses imply? Shouldn't they be >=0?

What I Did

Running a simple fit and train on adults dataset

Min/max in metadata not used

Is the min/max info supposed to be used anywhere? https://github.com/DAI-Lab/CTGAN/blob/7aa29685045ffdba84bd87432354c133e05699e6/ctgan/ctgan_model.py#L59-L60

Applying CTGAN to single colum?

Hi there

I’m wondering if there is a way to use CTGAN on a single column ? For example if I have a dataframe with 100 columns but only want to generate Data for one column by leveraging relationship in other columns ?

Consider moving training related parameters to `fit()`

Things like weight decay, batch size, and epochs may better belong in the fit method.
https://github.com/DAI-Lab/CTGAN/blob/94ac76b3ff4dae827eb626abe65923150633e359/ctgan/ctgan_model.py#L426-L432

Not working with Discrete_columns containing integers

CTGAN version: 0.2
Python version: 3.7
Operating System: Mac Catalina 10.15

Description

The definition of discrete columns is correct on the homepage; stating that discrete columns can indeed be integers or strings. However in practice I have not found the CTGANSynthesizer to work with discrete_columns that contain integers.

What I Did

Using the Census demo dataset, I looked at how many unique values there are for each column.

age 73
workclass 9
fnlwgt 21648
education 16
education-num 16
marital-status 7
occupation 15
relationship 6
race 5
sex 2
capital-gain 119
capital-loss 92
hours-per-week 94
native-country 42
income 2

With the except on 'fnlwgt' which is clearly continuous, it seems odd to me that integer columns like education-num, hours-per-week, capital-loss, capital-gain and even age are not added to discrete_columns too - as a very general rule, if a column contains less that say 5% unique values i'd see it's pretty likely to be discrete in most cases.

Regardless, if I list any integer column within discrete_columns i get errors.
For example, if i add 'education-num' to the discrete_columns list i get this error:

ValueError: could not convert string to float: ' Never-married'

This is strange since the error is not associated with 'education-num' which I just added but with the 'marital-status' column.

Are there any examples of CTGAN working with discrete integer columns?
It seems that the demo definition of discrete is any column containing strings.

Feature request: Joint/3D tabular data

Hi there,

on the way on shaping my own CGAN I stumbled upon your application which is indeed quite meaty and impressive. As you're someway ahead of my application I just wanted to ask whether it might be appendable by a certain feature:

Right now, the generation is 2d, so each line is regarded as a single, independent output. In my respective data set, the data is structured as follows: (object, year, features). It displays the development of features of an object during a certain time frame (e.g. 20 years); some of them are static (like size=1.89,1.89,1.89,1.89...), some subjects to change over a time (like, e.g., the age: 1,2,3,4... or weight=80,78.4,77.2,74.2...). A GAN output would thus ideally (and maybe quite similar to image generation) produce (1,20,feature) outputs that would display a connectedness of data within the year frame. Not certain how to implement this neatly within your code right now, but it would be highly appreciated ;-).

Regards,
Tobias

Reproducibility

CTGAN version: 0.2.1
Python version: 3.5
Operating System: Linux

Description

If I run CTGAN twice with every setting being same and fit on same dataset, while sampling, I get different points and also different losses. How to make it consistent. I tried setting torch random seed but didn't work.

What I Did

from ctgan import CTGANSynthesizer
from sklearn.datasets import make_blobs
X = make_blobs()
ft = X[0]

ctgan = CTGANSynthesizer()
ctgan.fit(ft,epochs=10)
s1=ctgan.sample(100)

ctgan1 = CTGANSynthesizer()
ctgan1.fit(ft,epochs=10)
s2=ctgan1.sample(100)

I want s1 and s2 to be same.

simple script for deploying ctgan onto a server - looking for feedback

Hey guys,
huge fan of the project. I recently deployed ctgan onto a server and wrote some simple scripts to make the deployment easier. It features a simple CLI to prompt the user through the process of creating synthetic data.

if you're interested please check out: https://github.com/oregonpillow/ctgan-server-cli

I'm very new to programming and data science and would really welcome any advice and feedback. So please have mercy on my noob programming / methods of implementation :)
Any advice / constructive feedback would be really appreciated.

-Tim

Any way to fix one or more categorical variable value during the data generating stage?

Thank you for the great works!

In the case when I want to fix one or two categorical columns' value before generating the data. What would be the ideal way to do it? If it is not a current feature, will you consider supporting it in the future?

For example, instead of generating random 1000 samples for the Adult Census Dataset. I want to generate 1000 samples with Income <= 50k

Any suggestion will be much appreciated!

PR for modular transformer

Description

I refactored the transformer class to be more modular for my own work. Would you guys be interested in a PR?
It now has 1 main Transformer (DataTransformer) that uses other more specific transformers for continuous values, discrete values and possibly other values later on (I added dates, for example). These transformers all have abstract methods for fit, transform and inverse_transform.

It is a bit like what you had in TGAN, but more modular. You can swap out every component with custom transformers.

Let me know. It's quite a bit of work to rewrite it too your latest dev branch, so wanted to check first. :)

To give you some idea, my DataTransformer is currently like this:


class DataTransformer(object):
    """Data Transformer.

    Flexible transformer class, that uses specific classes for transforming discrete,
    continuous and date data.
    """

    def __init__(self,
                 n_clusters: int = 10,
                 epsilon: float = 0.005,
                 continuous_transformer: Transformer = None,
                 discrete_transformer: Transformer = None,
                 date_transformer: Transformer = None
                 ):
        """ DataTransformer Init

        Args:
            n_clusters (int, optional): Number of modes. Defaults to 10.
            epsilon (float, optional): Epsilon value for Bayesian Gaussian Mixture Model. Will be ignored if weight of a mode is < epsilon.
                Defaults to 0.005.
            continuous_transformer (Transformer, optional): The continuous transformer that will be used. Defaults to None.
            discrete_transformer (Transformer, optional): The discrete transformer that will be used. Defaults to None.
            date_transformer (Transformer, optional): The date transformer that will be used. Defaults to None.
        """
        self.n_clusters = n_clusters
        self.epsilon = epsilon
        self.discrete_transformer = OneHotTransformer() if discrete_transformer is None else discrete_transformer
        self.continuous_transformer = GMMTransformer(self.n_clusters, self.epsilon, model='gmm') if continuous_transformer is None \
            else continuous_transformer
        self.date_transformer = DateTransformer() if date_transformer is None else date_transformer
        self.output_info = []
        self.output_dimensions = 0

    def fit(self, data: pd.DataFrame, discrete_columns: Tuple = tuple(), date_columns: Tuple = tuple()):
        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data)

        self.dtypes = data.infer_objects().dtypes
        self.meta = []
        for idx, column in enumerate(data.columns):
            column_data = data[[column]].values
            if column in discrete_columns:
                meta = self.discrete_transformer.fit(column, column_data)
            elif column in date_columns:
                meta = self.date_transformer.fit(column, column_data)
            else:
                meta = self.continuous_transformer.fit(column, column_data)

            self.output_info += meta['output_info']
            self.output_dimensions += meta['output_dimensions']
            self.meta.append(meta)

    def transform(self, data: pd.DataFrame) -> np.ndarray:
        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data)

        values = []
        for idx, meta in enumerate(self.meta):
            column_data = data[[meta['name']]].values

            if meta['datatype'] == CONTINUOUS:
                values += self.continuous_transformer.transform(meta, column_data)
            elif meta['datatype'] == DISCRETE:
                values.append(self.discrete_transformer.transform(meta, column_data))
            elif meta['datatype'] == DATE:
                values.append(self.date_transformer.transform(meta, column_data))
            else:
                raise ValueError(f'datatype must be continuous, date or discrete, but is `{meta["datatype"]}`')
        return np.concatenate(values, axis=1).astype(float)

    def fit_transform(self, data, categorical_columns=tuple(), date_columns=tuple()):
        self.fit(data, discrete_columns=categorical_columns, date_columns=date_columns)
        return self.transform(data)

    def inverse_transform(self, data: np.ndarray, sigmas: np.ndarray = None) -> Union[np.ndarray, pd.DataFrame]:
        start = 0
        output = []
        column_names = []
        for meta in self.meta:
            dimensions = meta['output_dimensions']
            columns_data = data[:, start:start + dimensions]

            if meta['datatype'] == CONTINUOUS:
                sigma = sigmas[start] if sigmas else None
                inverted = self.continuous_transformer.inverse_transform(meta, columns_data, sigma)
            elif meta['datatype'] == DISCRETE:
                inverted = self.discrete_transformer.inverse_transform(meta, columns_data)
            elif meta['datatype'] == DATE:
                inverted = self.date_transformer.inverse_transform(meta, columns_data)
            else:
                raise ValueError(f'datatype must be continuous or discrete, but is `{meta["datatype"]}`')

            output.append(inverted)
            column_names.append(meta['name'])
            start += dimensions

        output = {colname: values.reshape(-1) for colname, values in zip(column_names, output)}
        output = pd.DataFrame(output, columns=column_names).astype(self.dtypes)

        if not self.dataframe:
            output = output.values

        return output

Implement model_dir parameter use

CTGAN version: stable
Python version: 3.7.5
Operating System: Ubuntu 18.04

Description

When specifying the model_dir parameter - no model is saved. This is because the parameter is never checked for within the CLI to perform a save operation. Pretty straight forward fix to write a method to accomplish this when you run the CLI with the param. I think I will try and create a MR this week for it.

What I Did

Ran the CLI with example data and the model_dir parameter.

mkdir examples/test_model
python3 -m ctgan.cli --data examples/adult.dat --meta examples/adult.meta --model_dir examples/test_model

Always running into nan losses

CTGAN version: 0.2.1
Python version: 3.7.7
Operating System: Ubuntu 14.04

Description

I'm using the sample code provided on the README to generate sample adult census data. When training the GAN, I run into nan losses every time after the 1st~5th epoch.

please delete

Fix flaky test associated with sampling

There's some randomness involved in https://travis-ci.org/sdv-dev/CTGAN/jobs/637977740#L250
associated with #20

Will need to have a more generous threshold or take a few more samples. Investigating.

Consider adding option to sample from true data frequency instead of logged frequency

In some applications, using the log-frequency in conditional sampling gives unrealistic samples. Would it be OK if we exposed a parameter to turn this off (defaulting to the current behavior/outlined in the paper, of course):

CTGAN/ctgan/conditional.py

Line 53 in 152da67

tmp = np.log(tmp + 1)

@leix28 @csala

CUDA out of memory

I have been trying to run a dataset that is ~100k rows with ~40 columns through the synthesizer but am getting

CUDA out of memory. Tried to allocate 2.52 GiB (GPU 0; 11.17 GiB total capacity; 9.79 GiB already allocated; 706.81 MiB free; 10.19 GiB reserved in total by PyTorch)

when I run

ctgan = CTGANSynthesizer(batch_size=50) ctgan.fit(data,discrete_cols,epochs=3,log_frequency=True)

I have reduced the batch size from a default of 500 to 50 and am still getting the above. Is it required to use a very very small batch size here with a large dataset?

I am able to run 10-20k rows just fine but would like to synthesize all available data. Any pointers around running large datasets?

Time Series

Hi do you perhaps know or have you yourself developed a generative model for multivariate time series. Thanks for the package!

Documentation: Guidance about picking hyperparameters

CTGAN version: v0.2.1
Python version: 3.7.7
Operating System: OSX 10.14.6

Description

I am working with a new tabular dataset where it is difficult to evaluate the results.

It has about 2500 rows and 40 columns.

It is difficult to know how to evaluate the model or what hyperparameters to use. Can you provide some guidance in the documentation? For example, is minimizing the loss of G and D the goal? Or what evaluation methods should be used?

handling NaNs and datatypes not perseved

CTGAN version: '0.2.2.dev0'
Python version: 3.6
Operating System: macOS

Description

Through an error on fit if a categorical columns has any NaN
in sampling datatypes returned are not preserved - on Adult data int are returned as categorical

What I Did

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns, epochs=50)

Code implementations of the VGM that aims to estimate the number of modes of the continuous column

It's said in the CTGAN paper that the number of the modes of the continuous column is estimated with variational Gaussian mixtures model (VGM).

However, after going through the code, I could not find the corresponding implementations of the variational Gaussian mixtures model that aims to estimate the number of modes. It seems that the number of modes of continuous columns is set default to 10 according the following snippets.

CTGAN/ctgan/transformer.py

Lines 23 to 34 in cb984fc

 def __init__(self, n_clusters=10, epsilon=0.005): 

 self.n_clusters = n_clusters 

 self.epsilon = epsilon 

 @ignore_warnings(category=ConvergenceWarning) 

 def _fit_continuous(self, column, data): 

 gm = BayesianGaussianMixture( 

 self.n_clusters, 

 weight_concentration_prior_type='dirichlet_process', 

 weight_concentration_prior=0.001, 

 n_init=1 

 )

, Or I have missed some other lines that estimate the number of modes.

It will be appreciated if you could resolve my doubts.

Link to R package

I put together an R package to help broaden the reach of y'all's excellent work at https://github.com/kasaai/ctgan, which also powers a short (insurance) industry-specific application paper. Wanted to see if you'd like to link to the package in the README for the useRs that stumble upon the repo :)

@leix28 @csala

Getting negative values for some features?

Why am I getting negative values for feature like 'charges' or in case of demo dataset the feature 'capital-gain'?

How to treat missing continuous data in data set

CTGAN version:0.2.2
Python version: 3.6.8
Operating System: Tensorflow Docker

Description

Hi, I'm curious how to treat missing continuous values in a training Dataset. Using a placeholder that was used in the demo file ('?') for missing discrete values won't work. Any suggestions, how to deal with this issue?
Kind Regards!

Cant install CTGAN

CTGAN version: latest
Python version: 3.7.7
Operating System: Windows 10 home

Description

When trying to install ctgan i get the following error:

ERROR: Could not find a version that satisfies the requirement torch<2,>=1.0 (from ctgan) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch<2,>=1.0 (from ctgan)

What I Did

pip install ctgan

I also banged my fist on the desk...

Not able to fit() CTGANSynthesizer() if NaNs are present in the dataset

CTGAN version:
Python version:
Operating System:

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

	print("Epoch %d, Loss G: %.4f, Loss D: %.4f" %
	(i + 1, loss_g.detach().cpu(), loss_d.detach().cpu()))

	def __init__(self, n_clusters=10, epsilon=0.005):
	self.n_clusters = n_clusters
	self.epsilon = epsilon

	@ignore_warnings(category=ConvergenceWarning)
	def _fit_continuous(self, column, data):
	gm = BayesianGaussianMixture(
	self.n_clusters,
	weight_concentration_prior_type='dirichlet_process',
	weight_concentration_prior=0.001,
	n_init=1
	)

sdv-dev / ctgan Goto Github PK

ctgan's People

Contributors

Stargazers

Watchers

Forkers

ctgan's Issues

Description

What I Did

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Question

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

What I Did

Description

Description

Description

What I Did

Description

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org

Jobs