worldbank / realtabformer Goto Github PK

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.

Home Page: https://worldbank.github.io/REaLTabFormer/

License: MIT License

Makefile 0.27% Shell 0.16% Python 42.27% Jupyter Notebook 57.30%

data-generation deep-learning seq2seq-model synthetic-data synthetic-dataset-generation tabular-data transformers gpt gpt-2

realtabformer's Introduction

REaLTabFormer

The REaLTabFormer (Realistic Relational and Tabular Data using Transformers) offers a unified framework for synthesizing tabular data of different types. A sequence-to-sequence (Seq2Seq) model is used for generating synthetic relational datasets. The REaLTabFormer model for a non-relational tabular data uses GPT-2, and can be used out-of-the-box to model any tabular data with independent observations.

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
Paper on ArXiv

Installation

REaLTabFormer is available on PyPi and can be easily installed with pip (Python version >= 3.7):

pip install realtabformer

Usage

We show examples of using the REaLTabFormer for modeling and generating synthetic data from a trained model.

REaLTabFormer for regular tabular data

# pip install realtabformer
import pandas as pd
from realtabformer import REaLTabFormer

df = pd.read_csv("foo.csv")

# NOTE: Remove any unique identifiers in the
# data that you don't want to be modeled.

# Non-relational or parent table.
rtf_model = REaLTabFormer(
    model_type="tabular",
    gradient_accumulation_steps=4,
    logging_steps=100)

# Fit the model on the dataset.
# Additional parameters can be
# passed to the `.fit` method.
rtf_model.fit(df)

# Save the model to the current directory.
# A new directory `rtf_model/` will be created.
# In it, a directory with the model's
# experiment id `idXXXX` will also be created
# where the artefacts of the model will be stored.
rtf_model.save("rtf_model/")

# Generate synthetic data with the same
# number of observations as the real dataset.
samples = rtf_model.sample(n_samples=len(df))

# Load the saved model. The directory to the
# experiment must be provided.
rtf_model2 = REaLTabFormer.load_from_dir(
    path="rtf_model/idXXXX")

REaLTabFormer for relational data

# pip install realtabformer
import os
import pandas as pd
from pathlib import Path
from realtabformer import REaLTabFormer

parent_df = pd.read_csv("foo.csv")
child_df = pd.read_csv("bar.csv")
join_on = "unique_id"

# Make sure that the key columns in both the
# parent and the child table have the same name.
assert ((join_on in parent_df.columns) and
        (join_on in child_df.columns))

# Non-relational or parent table. Don't include the
# unique_id field.
parent_model = REaLTabFormer(model_type="tabular")
parent_model.fit(parent_df.drop(join_on, axis=1))

pdir = Path("rtf_parent/")
parent_model.save(pdir)

# # Get the most recently saved parent model,
# # or a specify some other saved model.
# parent_model_path = pdir / "idXXX"
parent_model_path = sorted([
    p for p in pdir.glob("id*") if p.is_dir()],
    key=os.path.getmtime)[-1]

child_model = REaLTabFormer(
    model_type="relational",
    parent_realtabformer_path=parent_model_path,
    output_max_length=None,
    train_size=0.8)

child_model.fit(
    df=child_df,
    in_df=parent_df,
    join_on=join_on)

# Generate parent samples.
parent_samples = parent_model.sample(len(parend_df))

# Create the unique ids based on the index.
parent_samples.index.name = join_on
parent_samples = parent_samples.reset_index()

# Generate the relational observations.
child_samples = child_model.sample(
    input_unique_ids=parent_samples[join_on],
    input_df=parent_samples.drop(join_on, axis=1),
    gen_batch=64)

Validators for synthetic samples

The REaLTabFormer framework provides an interface to easily build observation validators for filtering invalid synthetic samples. We show an example of using the GeoValidator below. The chart on the left shows the distribution of the generated latitude and longitude without validation. The chart on the right shows the synthetic samples with observations that have been validated using the GeoValidator with the California boundary. Still, even when we did not optimally train the model for generating this, the invalid samples (falling outside of the boundary) are scarce from the generated data with no validator.

# !pip install geopandas &> /dev/null
# !pip install realtabformer &> /dev/null
# !git clone https://github.com/joncutrer/geopandas-tutorial.git &> /dev/null
import geopandas
import seaborn as sns
import matplotlib.pyplot as plt
from realtabformer import REaLTabFormer
from realtabformer import rtf_validators as rtf_val
from shapely.geometry import Polygon, LineString, Point, MultiPolygon
from sklearn.datasets import fetch_california_housing


def plot_sf(data, samples, title=None):
    xlims = (-126, -113.5)
    ylims = (31, 43)
    bins = (50, 50)

    dd = samples.copy()
    pp = dd.loc[
        dd["Longitude"].between(data["Longitude"].min(), data["Longitude"].max()) &
        dd["Latitude"].between(data["Latitude"].min(), data["Latitude"].max())
    ]

    g = sns.JointGrid(data=pp, x="Longitude", y="Latitude", marginal_ticks=True)
    g.plot_joint(
        sns.histplot,
        bins=bins,
    )

    states[states['NAME'] == 'California'].boundary.plot(ax=g.ax_joint)
    g.ax_joint.set_xlim(*xlims)
    g.ax_joint.set_ylim(*ylims)

    g.plot_marginals(sns.histplot, element="step", color="#03012d")

    if title:
        g.ax_joint.set_title(title)

    plt.tight_layout()

# Get geographic files
states = geopandas.read_file('geopandas-tutorial/data/usa-states-census-2014.shp')
states = states.to_crs("EPSG:4326")  # GPS Projection

# Get the California housing dataset
data = fetch_california_housing(as_frame=True).frame

# We create a model with small epochs for the demo, default is 200.
rtf_model = REaLTabFormer(
    model_type="tabular",
    batch_size=64,
    epochs=10,
    gradient_accumulation_steps=4,
    logging_steps=100)

# Fit the specified model. We also reduce the num_bootstrap, default is 500.
rtf_model.fit(data, num_bootstrap=10)

# Save the trained model
rtf_model.save("rtf_model/")

# Sample raw data without validator
samples_raw = rtf_model.sample(n_samples=10240, gen_batch=512)

# Sample data with the geographic validator
obs_validator = rtf_val.ObservationValidator()
obs_validator.add_validator(
    "geo_validator",
    rtf_val.GeoValidator(
        MultiPolygon(states[states['NAME'] == 'California'].geometry[0])),
    ("Longitude", "Latitude")
)

samples_validated = rtf_model.sample(
    n_samples=10240, gen_batch=512,
    validator=obs_validator,
)

# Visualize the samples
plot_sf(data, samples_raw, title="Raw samples")
plot_sf(data, samples_validated, title="Validated samples")

Citation

Please cite our work if you use the REaLTabFormer in your projects or research.

@article{solatorio2023realtabformer,
  title={REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers},
  author={Solatorio, Aivin V. and Dupriez, Olivier},
  journal={arXiv preprint arXiv:2302.02041},
  year={2023}
}

Acknowledgments

We thank the World Bank-UNHCR Joint Data Center on Forced Displacement (JDC) for funding the project "Enhancing Responsible Microdata Access to Improve Policy and Response in Forced Displacement Situations" (KP-P174174-GINP-TF0B5124). A part of the fund went into supporting the development of the REaLTabFormer framework which was used to generate the synthetic population for research on disclosure risk and the mosaic effect.

We also send 🤗 to the HuggingFace 🤗 for all the open-sourced software they release. And to all open-sourced projects, thank you!

realtabformer's People

Contributors

Stargazers

Watchers

realtabformer's Issues

Relational Training: CUDA OOM Out of Memory Error

Hi,

Thanks for developing and releasing this codebase. I'm using it to train on a tabular data. I tried both in Tabular format and Relational format. But in the Relational format I'm getting CUDA OOM (Out of Memory Error).

Original Table (Raw data):
The original table has only 8 cols x 10,000 rows (which I have subsampled for testing).
The model in Tabular mode trains perfectly fine and I am able to generate synthetic samples.

Relational Table Format (Parent / Child):
In the relational format the tables have the following statistics:

Parent: 4 cols x 5359 rows
Child: 6 cols x 10,000 rows, where one parent row has ~150 corresponding child rows (at most).

However, in this case:

Parent model in Tabular format trains well, but
Child model in Relational format, with the parent model, fails with error CUDA OOM (Out of Memory Error).

I have tried this on GCP with

NVIDIA T4 x2 (16GB)
NVIDIA L4 x2 (24GB)
NVIDIA A100 (40GB)

I suspect the Relational format Child model fails because it requires both the Parent & Child tables to be loaded into GPU memory. But the dataset is tiny. How can I overcome the OOM error?

Do you have any suggestions?

Problem when fitting dataframe with only categorical features.

@avsolatorio Hi!

This line of code fails when fitting a dataframe that contains only Categorical features. Because no numerical or datetime features exists the list is empty and the pd.concat fails.

https://github.com/avsolatorio/REaLTabFormer/blob/311470accc400e4c7fae6fb2d8a7f9c3988b7b19/src/realtabformer/data_utils.py#L495

CPU OOM during tokenization - Tabular format

I have the following issue in training the model in tabular format (30 million samples).

I used REalTabFormer for generating synthetic data in tabular format. The following configuration helped us in training and generating 1M samples. It's working extremely well.

CPU: 16 CORE - 60 GB RAM
GPU: 16 GB Nvidia-T4

However, when I want to use it for large datasets, for example 30M samples, on the same machine, the RAM crashes with OOM error during the tokenization stage itself.

Docker image to run REalTabFormer to support NVIDIA CUDA and GPU

Hi @avsolatorio,

I was wondering if you have any latest docker image example from NVIDIA to support running a Python application with REalTabFormer that will utilize GPU from host machine.

Thanks!

How to model a child table with multiple foreign keys?

Hi @avsolatorio !

I have a child table with multiple foreign keys to different parent tables.

Something like Car<-CarStatus, Car<-CarManufacturer relationships.

How can I generate the child table Car? It seems that I have to build many relational models for each relationship but then how should I merge the generated sampled data into one table?

Thanks!

Is it possible to run REalTabFormer on AWS Inferentia and Trainium VM instances?

Hi @avsolatorio,

Recently, I came accross AWS Inferentia and AWS Trainium that AWS claim to be 70% cheaper and with 2.3xx higher throughput than GPU based VMs. I was wondering if REalTabFormer can run on those isntances as-is without changes in code and if it worth to explorer further.

Could you please suggest?

Thanks!

Control transformers verbosity

Add a flag that will set the verbosity of transformers only for errors to reduce clutter in the output.

from transformers import logging as hf_logging
hf_logging.set_verbosity_error()

Parallelization of inference/generation in both tabular and child models.

Hi @avsolatorio,

I want to generate houndred of thousands rows using both tabular and relational models. However, it is a bit slow because of the auto-regressive nature of transformers. Currently I am generating e.g. 100k rows in tabular model and then I let the child model to generate the child rows.

Is there something I can do to optimize generation time either for tabular or child models? We touched this a little in the past #15.
From my understanding multi-GPU cannot be used in the inference part, like training does. However, instead of running a parent_model.sample(n_samples=100_000) I can run in parallel multiple parent_model.sample() calls using a different cuda device for each? Let's say break it in 10x10_000 and run 10 rtf_model.sample() calls, each on different GPU card, or have at least a pool of GPU cards to utilize.
Can I use a single GPU card and run multiple parent_model.sample() calls in multiple threads in parallel? Probably this is going to fail as the GPU memory will explode, right?
Is there any special argument I need to use in parent_model.sample() to support that? Is it device argument to specify the cuda device for each batch generation?
How can I partition and batch the rows of a child model? Currently the relationship cardinality is something that is learned by the model, so I can't specify it. I can estimate it outside of the library, #orders a customer could have, #products an order could have. Maybe I can built such a model, However, is it possible to tell the child model how many child rows to create for each parent? If that is possible, I will be able to pass that number from my external estimation.

Thanks!

Support for pre-computation, saving, and loading of the sensitivity threshold

One of the current bottlenecks in fitting the non-relational model is the pre-computation of the sensitivity threshold.

A solution to remedy this is to allow for the pre-computation of the sensitivity threshold outside the fit function. One can specify a file containing the pre-computed value when fitting with the data. The file can be a JSON the contains the parameters for computing the sensitivity threshold, together with the results itself.

When the file is passed to the fit function, the function must first check if the parameters used in the pre-computation are consistent with the parameters passed in the fit function. Then, simply load it and skip the computation.

This is an excellent first issue if anyone is interested in contributing! :)

Transaction datetime in the child table is not sequential

The transactions in the generated child table belonging to the same parent join key do not seem to be correctly sorted. The original data used for training is sorted though. So the model is not capturing the time sequential structure of different transactions of the same user in the raw dataset?

pan transactionDateTime
0 2020-07-11 03:05:47
0 2020-06-27 14:31:24
0 2020-06-07 01:06:45
0 2020-06-05 19:23:52
0 2020-02-25 12:20:27
0 2020-05-22 18:58:40
0 2020-06-06 10:15:20
0 2020-04-14 17:35:39
0 2020-03-03 08:51:58
0 2020-05-23 13:47:57
0 2020-04-12 05:12:54
0 2020-04-19 02:23:20

No "unique_id" in the child table

Thank you for the great work!

I am trying the tool on the Airbnb dataset and I managed to generate both parent_samples and child_samples.

However, I noticed that user_id only exist in parent_samples while it's missing in child_samples. Is there a way then to identify for each row in child_samples, which user_id it belongs to?

Inquiries on fitting parent and child tables

Hi there. I am trying to sample data of two different transaction records based on the user ID, with both datasets containing repeating ID so I have created a third dataset that contains the unique user ID. I set the dataset with unique user ID as parent data. And I would like to know how can I link the two transaction record datasets to the parent data while also considering the intercorrelation between the two child tables? Should I build a sub child table that links to the child table or link both child tables to the parent table? Thanks.

AssertionError: The target length 10 of the data doesn't include the numeric precision at 20. Increase max_len to at least 22.

How to resolve this?

cannot import name 'is_fairscale_available' from 'transformers.integrations

It seems the huggingface transformers does not support is_fairscale_available() function any more. Which specific version shall we use?

Conditional generation?

Hi I found your work today I think from googling about overfitting and data copying in these kinds of models. There are some very interesting ideas re: DCR and the Q metric that I think are pretty interesting. I have a need to generate data conditionally. I have some labels for medical imaging data and I need to sample using the timestamp and a label as the conditional information.

Could I add the conditional data to the text of the input? Is this a use case which anyone has explored?

thanks

Possible Improvements for CPU inference

Hi, I am currently trying to improve the inference time. However for a given batch size of 512 sample generation the inference time of the gpu is twice as the cpu. Any idea on it ?

child_samples = model.sample(n_samples=512, input_unique_ids=query[self.join_on], input_df=query.drop(self.join_on, axis=1), gen_batch=512,device=self.device)

Note that the model is relational and no frozen encoder given. Moreover if there is a general tips for cpu inference for the RealTabformer I am eager to learn. Thanks for the neat repo. Cheers

Can we treat the method as one of data augmentation?

First, This is a interestion method and thanks for sharing the code.

Just a question about paper's detail.
We all know that data augmentation for tabular regression is hard to implement.

I am wondering if I use this method as data augmentation and compare to SMOGN or others method that augment tabular regression data. Will it be appropriate why or why not?

I am not sure whether this is the right place to talk about the paper, if not I will delete the issue.

Thanks.

Multi-GPU training

I was trying to run training on multiple GPU servers in AWS, but it is not training as expected. Is there a way to enable this?

Possible mix-up of token columns

Hello. In the parent model training during the validation phase when the model is generating synthetic data to compare with real raw data, the training is somehow terminated in the middle (usually around 45-50 epochs) by ValueError due to invalid string representation of float numbers. For example, I see something like "ValueError: could not convert string to float: '-0.-0.2'". I checked the model vocabulary and everything seems to be fine. For example, column_00 has tokens of "-0.", "01", -3.", and column_01 has tokens of "123","532","324". If for each processed column the model only generates tokens specific to that column, then why we would see '-0.-0.2'. It seems that tokens in column_00 is sampled for the location of column_01. Do you have any idea where the things could go wrong in the source code so that this issue could happen?

ERROR: multiprocess 0.70.15 has requirement dill>=0.3.7, but you'll have dill 0.3.6 which is incompatible.

Hi there!

Currently, the error appears when installing v0.1.2 with pip install.

Is the calculation of num_train_epochs correct?

Hi @avsolatorio,

I am looking the code of _train_with_sensitivity() and I can't understand why we calculate the num_train_epochs in that way

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L692

Assuming, we run for 100 epochs and n_critic is 5 we are going to have the following pairs of [p_epoch, num_train_epochs]

p_epoch, num_train_epochs
0, 5
5, 10
10, 15
15, 20
...
80, 85
85, 90
90, 95
95, 100

In the following two lines we set the num_train_epochs:

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L698

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L705

Is that correct? On first iteration where p_epoch=0 and num_train_epochs=5 it is OK to train the model for 5 epochs. But in the next iteration where p_epoch=5 and num_train_epochs=10 why we should continue training the model for 10 epochs? Shouldn't we just contrinue training it for 5 more epochs? At the extreme in the last iteration where p_epoch=95 we train the model for num_train_epochs=100 epochs?

Thanks.

Speeding the training on mixed data set - categorical data, numerical and text.

Trying to train the model on custom data which has various categorical feature with very high diversity like City, text features and numerical feature. Data size is small - 380K.

But the training was never starting! It is stuck at this for few hours!

How to improve the training?

Model save only works for Tabular model type.

@avsolatorio Hi!

I have tried to use checkpoints_dir parameter for a relational model and it seems that no checkpoints are saved for this type. Also, in the code I see that only Tabular model type is handled. Is this a bug?

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L1450

How to model N-M association tables?

Hi @avsolatorio !

I was wondering how someone could model N-M table relationships.

Currently, we can model 1-1 and 1-N relationships with parent (tabular) and child (relational) models. However, is it possible to model N-M association tables?

An example association table could be student_courses_registrations that unites students and courses. A student can attend many courses and a course could be taken by multiple students:

courses ---- 1-N ---> students_courses_registrations <----- M-1 ----- students

Should we create parent tabular models for courses and students tables and then somehow use a child relational model for students_courses_registrations? Is it possible?

I can't find a way to do it. I was wondering if there is way to convert a N-M table to multiple 1-N relationships and then model the relationships separately. Of course later on I will have to revert the process to convert them to a N-M single table when sampling data.

What do you think? Is there a way to do it?

Track the column size and the number of digits in numerical fields for the transformation of `seed_input`

Note the leading zero indicating that the total number of columns is more than 9. But since we are not tracking this, the moment we use the seed_input argument, the transformation only infers from the given data and not from the data used during training.

The same is true in this case. The transformation should note that the hhsize variable has values that may exceed 10, so it should truncate the leading 0 as shown in the second image.

Out of memory exception on tabular model with 25k rows and 37 columns

Hi @avsolatorio,

I have a case with ~25k rows and 37 columns. Mixed data types with categorical, numerical and also some have high cardinality while other low cardinality. Also some columns have large number of NAs.

When I train the tabular model I get a memory error as it needs more than 50GB, also bootstrap threshold estimation is very slow.

Do you have any insights on why this happens or how this can be solved, is there any hyperparameter I can use to solve this?

Thanks!

Is it possible to do iterative training? Load the weight and retrain on new data.

Instead of loading entire training data, can I train the model on part of it, load the weights and retrain the data?

Logistic detection metric

Greetings,

I am a student and I have a strong interest in your work. Currently, I am engrossed in a research endeavor focused on the Anonymization of extensive relational databases through the utilization of synthetic data generation. What particularly captivated me is how seamlessly your work aligns with the objectives of my research.

My current endeavor involves attempting to replicate the outcomes outlined in your published paper. During this process, I observed that you employed Logistic detection as a metric and as depicted in the following image:

However, I encountered difficulty in locating an implementation of this metric, even within SDV (Synthetic Data Vault). Consequently, I find myself uncertain about the efficacy of my manual attempts in reproducing the same results.


import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

def compute_logistic_detection_score(real_data, synthetic_data, n_folds=3):
    # Combine the real and synthetic data
    real_data["orig"] = 0
    synthetic_data["orig"] = 1
    data = pd.concat([real_data.fillna(0), synthetic_data.fillna(0)])
    data = data.reset_index(drop=True).fillna(0)

    # Split the data into features and target
    X = data.drop("orig", axis=1)
    y = data["orig"]

    # Detect categorical variables
    categorical_columns = X.select_dtypes(include="object").columns.tolist()
    X = X.astype({col: "string" for col in categorical_columns})

    # Encode categorical variables
    label_encoder = LabelEncoder()
    X[categorical_columns] = X[categorical_columns].apply(label_encoder.fit_transform)

    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Compute ROC-AUC scores using cross-validation
    roc_auc_scores = cross_val_predict(
        rf_classifier, X, y, cv=n_folds, method="predict_proba"
    )[:, 1]

    # Transform the ROC-AUC scores using the given formula
    transformed_scores = np.maximum(0.5, roc_auc_scores) * 2 - 1

    # Calculate the average transformed ROC-AUC score
    avg_transformed_score = np.mean(transformed_scores)

    # Calculate the logistic detection (LD) score
    ld_score = 100 * (1 - avg_transformed_score)

    return ld_score

I also have another question. I'm eager to understand the necessary specifications to replicate the results presented in the table above, both for the AirBnB and Rossman datasets. In your publication, I noted the hardware configuration: 2x AMD EPYC 7H12 64-Core Processor, 2x RTX 3090 GPU, and 1TB RAM, all running on Ubuntu 20.04 LTS.

However, I am inclined to believe that this configuration might be somewhat excessive, and I wonder if it's possible to achieve the same outcomes with a more modest setup, specifically tailored to reproducing only the results displayed in the aforementioned table. If this is indeed the case, I am genuinely interested in discovering the minimal configuration necessary for this task.

Thank you immensely for your assistance.

Is it possible to utilize Distributed Training and/or Parallel Sampling with the library?

Hi @avsolatorio,

I would like to know if it is possible to distribute the training on environment with multiple GPUs or even to multiple machines?

Also, is it possible to parallelizing the sampling operation with the library?

Currently, I have a single environment with one GPU and I run the training on Google Colab.

Thanks.

No train with sensitivity for Relational model?

Hi @avsolatorio,

I see that train with sensitivity happens only in Tabular model type? Don't we use sensitivity when training a relational model? Why is that?

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/realtabformer.py#L456

rtf_checkpoints bug when fitting the GeoValidator example model

Running the REaLTabFormer_GeoValidator_Example.ipynb on Google Colab results in the following error during the rtf_model.fit(data, num_bootstrap=10):

/usr/local/lib/python3.10/dist-packages/realtabformer/realtabformer.py:834: UserWarning: No best model was saved. Loading the closest model to the sensitivity_threshold.
  warnings.warn(
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-9-d11ec56195b9>](https://localhost:8080/#) in <cell line: 2>()
      1 hf_logging.set_verbosity_error()
----> 2 rtf_model.fit(data, num_bootstrap=10)

6 frames
[/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py](https://localhost:8080/#) in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    399         if not os.path.isfile(resolved_file):
    400             if _raise_exceptions_for_missing_entries:
--> 401                 raise EnvironmentError(
    402                     f"{path_or_repo_id} does not appear to have a file named {full_filename}. Checkout "
    403                     f"'[https://huggingface.co/{path_or_repo_id}/{revision}'](https://huggingface.co/%7Bpath_or_repo_id%7D/%7Brevision%7D') for available files."

OSError: rtf_checkpoints/not-best-disc-model does not appear to have a file named config.json. Checkout 'https://huggingface.co/rtf_checkpoints/not-best-disc-model/main' for available files.

I get the same error when fitting a different dataset on my local machine (MAC OS, python 3.10.13).

The problem seems to be the latest version of the transformers library, since reverting to transformers==4.24.0 (from pyproject.toml) fixes the problem.

Running Realtab on Macs

Did anyone tried running this on Mac M2 Ultra or Max?

I tried it on M1 Pro. It chokes the memory. How to check if the training is using the GPU cores?
https://wandb.ai/capecape/pytorch-M1Pro/reports/PyTorch-Runs-On-the-GPU-of-Apple-M1-Macs-Now-Announcement-With-Code-Samples---VmlldzoyMDMyNzMz?galleryTag=ml-news

This article mentioned about running pytorch on M1. I tried the same settings on M1 pro but the memory was pretty much choked and not sure if the training is using GPUs or not.

Early stopping with sensitivity vs validation loss metric and the effects on synthetic data quality.

Hi @avsolatorio,

Hope everything is well!

I have noticed that by specifying n_critic=0 when training a tabular model is a way to disable train with sensitivity. In my use case and dataset the threshold estimation of sensitivity metric was too slow and needed large amounts of memory (more than 200GB or RAM). So, I have tried to replace it with a classic one based on validation loss. So, currently I use the following code:

parent_model = REaLTabFormer(model_type="tabular",
                             batch_size=8,
                             epochs=30,
                             gradient_accumulation_steps=1,
                             logging_steps=25,
                             save_strategy="epoch",          # CLASSIC EARLY STOPPING
                             evaluation_strategy="epoch",    # CLASSIC EARLY STOPPING
                             train_size=0.8,                 # CLASSIC EARLY STOPPING
                             early_stopping_patience=5,      # CLASSIC EARLY STOPPING
                             early_stopping_threshold=0,     # CLASSIC EARLY STOPPING
                             checkpoints_dir = MODEL_RUN_DIRECTORY_PATH / f'{table_name}_checkpoints')
							 
trainer = parent_model.fit(df=table_data,
                           n_critic=0,    # CLASSIC EARLY STOPPING
                           device='cuda')

trainer.state.save_to_json(MODEL_RUN_DIRECTORY_PATH / f'{table_name}_checkpoints' / "trainer_state.json")

parent_model.save(MODEL_RUN_DIRECTORY_PATH / f"{table_name}_model")

My question is: how much do we lose in quality if we use classic early stopping instead of sensitivity-based stopping criteria? I am not asking for an exact number, just to have an idea from your experience. Is training with sensitivity very different from classic early stopping in terms of the quality of synthetic data? I am asking this because, for example, the relational model doesn't use training with sensitivity. Furthermore, I have noticed a boost in performance, since sensitivity threshold estimation at the beginning of training is very slow in my data (even with many CPU cores when parallelization is used). I am thinking of using classic early stopping and looking for some validation that this will not significantly decrease the quality of the synthetic data. Of course, I will check it, but here I am asking for your insight first to get validation.

Lastly, here is a train-validation loss plot training with early stopping with validation loss (just an example):

Thanks!

Future warnings for AdamW and encoder-decoder loss (v4.12.0).

Hi everyone!

When training models with model.fit() we get the following two future warnings:

transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning

transformers/models/encoder_decoder/modeling_encoder_decoder.py:634: FutureWarning: Version v4.12.0 introduces a better way to train encoder-decoder models by computing the loss inside the encoder-decoder framework rather than in the decoder itself. You may observe training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0. The decoder_input_ids are now created based on the labels, no need to pass them yourself anymore.

Are these two warnings safe for now? Thanks.

See "IndexError: index out of range in self" when related_num parameter is specified in child model sampler

I observed that with default parameters, the generated child table has much less rows than the child training data even with same number of parent table rows, so I started to put count in the parent table for the parent model to learn, and then passed this column name in the child model sampler, expecting that the generated child table would have specified number of rows for each parent record. However, what I saw was that the child sampling had been running a while, until an exception was thrown by torch indicating "index out of range" error. The count column is checked to have integer values >=1.

Also, if I just specify related_num to be a large number instead of column name, I would see the same error.

Maximum number of columns limitation in tabular GPT-2 model?

Hi @avsolatorio,

I would like to ask if there is a limitation on the maximum number of columns that can be passed to a tabular model? Is there an upper limit? Is it going to fail in case there are many columns?

Of course, I am talking about the case of using the classic early stopping mechanism and not the critic one, because in the past we have seen that having the critic metric with high-dimensional data might lead to large memory consumption and many errors can occur.

So, my use case is fitting a tabular model with a simple early stopping (no critic-sensitivity metric). Is GPT-2 going to fail with many columns at its input when training or generation? I have in my dataset mixed types like, text, float, datetime, int, etc. Text columns are not going to be very lengthy they are just categorical values.

Lastly, is it possible that the tabular model to have such limitation while the relational not? If that's the case, maybe I could fit a relational instead as I see here https://github.com/worldbank/REaLTabFormer/issues/22#issuecomment-1598082977 by providing no parent. Also, in the past (#11) I remember you have told me that the relational model has no limitation like the pre-trained GPT-2 but probably that was only for the generation part?

Thanks!

Is there a way to overcome output max length limitation?

Hi @avsolatorio,

I see there is a limit on max output length and whenever we exceed it we skip the training example. Can we somehow overcome this problem and be able to learn generating examples with many children rows?

https://github.com/avsolatorio/REaLTabFormer/blob/bf1a38ef8f202372956ac57a363289c505967982/src/realtabformer/data_utils.py#L755

Bug in model.sample() when column contains integer values while column type is string.

Hi @avsolatorio,

I think I have found a possible bug in tabular.sample() which might also be present in relational.sample().

I have trained a tabular model with a dataframe containing the columns below. When I sample the model and get the sampled data for some reason new values exist that are out of sample (cannot be found in the train data) in the integer_as_str column. I was expecting to see no new values because the column type is object and the underlying type is Python str. For the integer, float, datetime columns I can see that new values are generated which is fine for me.

Below, you will find a sample of the train dataframe:

| integer_as_str | integer | float | boolean | datetime            | string |
|----------------|---------|-------|---------|---------------------|--------|
| 3              | 6214    | 54.09 | false   | 2002-10-15 03:07:53 | qyjib  |
| 31             | 2997    | 39.15 | false   | 1999-05-18 01:09:18 | mjuvv  |
| 38             | 3362    | 52.91 | true    | 1999-08-27 10:44:03 | ffskd  |
| 47             | 2286    | 50.68 | false   | 1999-02-02 05:48:06 | evqml  |
| 24             | 14482   | 77.8  | true    | 2001-09-08 13:56:20 | wieai  |

What do you think of this? Is it a bug? Could you help us fix this?

Bug when running tabular.fit() and tabular.sample() with CPU

Hello @avsolatorio,

There might be a bug when running tabular.fit() and tabular.sample() with device='cpu' (might also be a case in relational models, haven't tested).

I have trained a tabular model with CPU with a dataframe containing the columns in the following example. Their original data types were {integer_as_str: object[str], integer: int64, float: float64, boolean: bool, datetime: datetime64[ns], string: object[str]}.

integer_as_str	integer	float	boolean	datetime	string
03	6214	54.09	false	2002-10-15 03:07:53	qyjib
31	2997	39.15	false	1999-05-18 01:09:18	mjuvv
38	3362	52.91	true	1999-08-27 10:44:03	ffskd
47	2286	50.68	false	1999-02-02 05:48:06	evqml
24	14482	77.8	true	2001-09-08 13:56:20	wieai

In my case, I want to be able to generate only values that are present in the training data, indepedently of their type. In other words, I don't want to generate new values, that do not exist in training data.

In order to be able to achieve that, I have experimented with adding a letter in the beginning of each value (see transformation example below). What I was expecting was to see no new values in any of the columns. Instead, what I got were values of another data type (if we ignored a_, b_, etc). For example I got in datetime column a value of b_2997 (valid value but for another column!!), or I got in float column a value of e_1999-02-02 05:48:06 (again valid value but for another column!!)

integer_as_str	integer	float	boolean	datetime	string
a_03	b_6214	c_54.09	d_false	e_2002-10-15 03:07:53	f_qyjib
a_31	b_2997	c_39.15	d_false	e_1999-05-18 01:09:18	f_mjuvv
a_38	b_3362	c_52.91	d_true	e_1999-08-27 10:44:03	f_ffskd
a_47	b_2286	c_50.68	d_false	e_1999-02-02 05:48:06	f_evqml
a_24	b_14482	c_77.8	d_true	e_2001-09-08 13:56:20	f_wieai

Let me note here, that everything works as expected when both tabular.fit() and tabular.sample() run with device='cuda'. What do you think of this? Maybe this is a bug that happens only with CPU?

Is there a rule of thumb for NUM_BOOTSTRAP?

Hi @avsolatorio,

In my experiments I have the default value (500) for the bootstrap rounds when estimating the sensitivity threshold. I see in the implementation that this process is very CPU-bound and utilizes multicore if possible.

In my environment I have 8 CPU cores and usually on large tables it takes 1-2 hours to complete before training starts. All this time the GPU in my runtime environment is idle waiting the sensitivity threshold estimation to complete. (Also, in Colab sometimes it disconnects the runtime because it notices that the runtime uses mainly CPU).

I know that by setting this to a smaller value it will run faster but I wonder if there is a rule of thumb or it is just a matter of try-and-error. I understand that it is important to estimate correctly this threshold as it will be used for early stopping the training.

Thanks!

Use different join columns (parent_join_on, child_join_on) in relational model fit method.

Hi @avsolatorio!

I was wondering if it easy to support for relational model fit() different columns for joining the data. Something like pandas merge supports: (left_on, right_on):

DataFrame.merge(right, ..., left_on=None, right_on=None, ...)

Maybe you could add support for parent_join_on, child_join_on?

Currently, it supports only join_on:

child_model.fit(
    df=child_df,
    in_df=parent_df,
    join_on=join_on)

What do you think?

Generated datetime value in the child table is invalid

It seems that with smaller output_max_length parameter, the generated sample has invalid datetime values. Instead of datetime strings, they seem to be some numbers the meaning of which are vague. Also sometimes the number is quite long. When running on a small training dataset with about 10k records and output_max_length=4096, I didn't encounter such issue. But with a large training dataset with 1m records and output_max_length=512 (or 1000), this issue occurs. Is it because the output_max_length truncation somehow corrupted the datetime tokens

transactionDateTime
12944829
12863510
12103586
2293176
12628443
12244450294269574381
12538574
18949556
12353260
16880274
9405910
10618463
10250250
872232
2677221
12122979
120384286117778
16718277
11588905

Bug in REaLTabFormer.sample() when relational model generates no data

Hi @avsolatorio!

I have a super urgent bug ticket that I need to solve asap and I have solve it here. I wonder if you could just push it and create a new PyPI release.

It is related to the library in the def _processes_sample() method.

Bug:

I have trained a relational model and some times the data are not so many to learn the relationship and the sampling of the relational model generates an empty synth_sample and as a consequence the following line if synth_sample[col].iloc[0].startswith(col): fails with an exception because it assumes that at least one line at iloc 0 will always exist: https://github.com/worldbank/REaLTabFormer/blob/main/src/realtabformer/rtf_sampler.py#L449

The function def _processes_sample() is used also when sampling a tabular model and probably this can happen also in this tabular model sampling. Regardless the case (tabular/relational) 2-3 lines below the bug line (L449) the function already (and wisely) has some logic to check if the sampled data synth_df is empty and if that's the case it throws the SampleEmptyError exception.

Resolution:

So, I think we can add also a similar check above the line https://github.com/worldbank/REaLTabFormer/blob/main/src/realtabformer/rtf_sampler.py#L443 checking synth_sample dataframe instead, something like:

if synth_sample.empty:
    # Handle this exception in the sampling function.
    raise SampleEmptyError(in_size=len(sample_outputs))

Then in my application I can catch the specific exception:

from realtabformer.rtf_exceptions import SampleEmptyError

try:
   // run some REalTabFormer sampling code
except SampleEmptyError as e:
  // handle the specific exception

What do you think?

Can you do the fix and build a new PyPI package?

Thanks!

Can the model learn relationships of columns in long distance tables?

@avsolatorio Hi!

Assume the following 3 tables:

Table A
Column A.1
Column A.2
Column A.3

Table B
Column B.1
Column B.2
Column B.3

Table C
Column C.1
Column C.2
Column C.3

In my example I skip the primary and foreign keys for simplicity.

The relationships are:

Table A [1..N] Table B
Table B [1..N] Table C

The parent Tabular model type can be used to model each table separately and the Relational model type can be used to model the above two relationships. The parent and child models will capture column relationships/correlations that exist inside each table or relationship. But what happens with correlations that exist in columns between Table A and Table C or any pair of columns from any pair of tables in the database? How someone can learn such relationships? Is it possible with RealTabFormer?

What if the Table A.Column A.1 is correlated with Table C.Column C.2? Each child model is conditioned only in the parent row and thus can learn dependencies only between directly connected tables.

What do you think? Is it possible to overcome this somehow?

_validate_get_device() could be nice to be called also in model.sample() and model.predict()

Hi!

I have noticed that _validate_get_device() is called only on model.fit() but it could be nice if is called also in model.sample() and model.predict() so that there is no need to pass device argument in the methods when no cuda is available to use CPU. model.fit() already calls it and there is no need to pass device, it automatically detects what to use.

New PyPi release

Hi @avsolatorio,

I was wondering if you can create a new version in PyPi for the library.

Specifically, what I want is the following fix which is in main branch but is missing from the latest version (0.1.1):

bf1a38e

Is it possible to have this? Thanks!

Unable to complete training in colab

Trying to run the model on california pricing tabular data. But continuously facing this issue even if I install the libraries.

Am I missing anything?

Bug in model.sample() when column contains integer values while column type is string.

Hi @avsolatorio,

I had to recreate this issue because for some reason couldn't reopen the original one.

I have tested the fix from the main branch but it seems it is not working as expected. It continues to generate novel/new values when the column is string and contains numerical values.

I have added a zip with a notebook that demonstrates the case.

What do you think?

Originally posted by @echatzikyriakidis in #31 (comment)

GPU CUDA: Out Of Memory when training many models

Hi @avsolatorio,

I am training multiple tabular and relational models sequentially in a single Colab Notebook with GPU runtime (I have Google Colab Pro+) and I experience CUDA out of memory error after some time in one of my models. How can I use dispose/release a REalTabFormer model after its training to free GPU memory occupied?

Thanks!

No progress in model training for large number of columns

I'm trying to train the model on transactional data that has around 78 columns with numerical, text and categorical columns. I am trying to modify the training size by modifying rows and columns, but still, training is not progressing at all.

worldbank / realtabformer Goto Github PK

realtabformer's Introduction

REaLTabFormer

Installation

Usage

REaLTabFormer for regular tabular data

REaLTabFormer for relational data

Validators for synthetic samples

Citation

Acknowledgments

realtabformer's People

Contributors

Stargazers

Watchers

Forkers

realtabformer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs