Comments (2)
Hello @liu305! In the implementation, we convert datetime values into their equivalent timestamp values. This allows us to model the date as numeric data and generate "new" dates. Unfortunately, a timestamp is represented by a relatively long sequence of integers (~10 digits). So, depending on your data's dimensionality, it could be truncated when you have a shorter output_max_length
.
I wonder what your raw datetime values look like. You may use other ways of representing your data. For example, split the year, month, and day into different columns. Then, if the year is constant, you can remove that column. You will just need to do post-processing afterward.
For reference, the following is the specific code for handling datetime values.
def process_datetime_data(
series, transform_data: Dict = None
) -> Tuple[pd.Series, Dict]:
# Get the max_len from the current time.
# This will be ignored later if the actual max_len
# is shorter.
max_len = len(str(int(time.time())))
# Convert the datetimes to
# their equivalent timestamp values.
# Make sure that we don't convert the NaT
# to some integer.
series = series.copy()
series.loc[series.notnull()] = (series[series.notnull()].view(int) / 1e9).astype(
int
)
series = series.fillna(pd.NA)
# Take the mean value to re-align the data.
# This will help reduce the scale of the numeric
# data that will need to be generated. Let's just
# add this offset back later before casting.
mean_date = None
if transform_data is None:
mean_date = int(series.mean())
series -= mean_date
else:
# The mean_date should have been
# stored during fitting.
series -= transform_data["mean_date"]
# Then apply the numeric data processing
# pipeline.
series, transform_data = process_numeric_data(
series,
max_len=max_len,
numeric_precision=0,
transform_data=transform_data,
)
# Store the `mean_date` here because `process_numeric_data`
# expects a None transform_data during fitting.
if mean_date is not None:
transform_data["mean_date"] = mean_date
return series, transform_data```
from realtabformer.
Hi @avsolatorio. Thank you very much for your timely response! Correspondingly I have some further questions below which I would appreciate your input.
- Probably in my previous experiments I just used some high cardinality variables as they are, which made the vocabulary size so huge! Do you think a smaller vocabulary size would help in this case (so that output_max_length requirement can be relaxed)?
- In the same experiment, I also saw that in some other columns of generated data there are invalid values. For example, in the code column I even see name strings, which definitely should belong to another column. Is it also because of the same reason as the invalid datetime values, which is that output_max_length truncation makes things wrong.
- Do you think column ordering matters. For example, currently datetime column is the last column. If I move it to the first column, will it be less impacted by the output_max_length?
Best Regards,
from realtabformer.
Related Issues (20)
- Possible mix-up of token columns HOT 2
- Conditional generation? HOT 4
- Possible Improvements for CPU inference
- Bug in REaLTabFormer.sample() when relational model generates no data HOT 3
- cannot import name 'is_fairscale_available' from 'transformers.integrations HOT 1
- Multi-GPU training HOT 1
- Inquiries on fitting parent and child tables
- Is it possible to run REalTabFormer on AWS Inferentia and Trainium VM instances?
- Out of memory exception on tabular model with 25k rows and 37 columns HOT 3
- Early stopping with sensitivity vs validation loss metric and the effects on synthetic data quality. HOT 2
- rtf_checkpoints bug when fitting the GeoValidator example model
- Maximum number of columns limitation in tabular GPT-2 model?
- Parallelization of inference/generation in both tabular and child models.
- Python datetime.date data type is handled as str and datatype handling in general
- Could order of columns affect performance of synthetic data quality? HOT 2
- OSError: rtf_checkpoints/not-best-disc-model does not appear to have a file named config.json. HOT 2
- Question for Generate Synthetic data To compare HMA and REaLTabFormer HOT 2
- Documentation page not working HOT 2
- RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel: size mismatch for transformer.wte.weight
- missing data HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from realtabformer.