GithubHelp home page GithubHelp logo

Comments (2)

avsolatorio avatar avsolatorio commented on May 13, 2024

Hello @liu305! In the implementation, we convert datetime values into their equivalent timestamp values. This allows us to model the date as numeric data and generate "new" dates. Unfortunately, a timestamp is represented by a relatively long sequence of integers (~10 digits). So, depending on your data's dimensionality, it could be truncated when you have a shorter output_max_length.

I wonder what your raw datetime values look like. You may use other ways of representing your data. For example, split the year, month, and day into different columns. Then, if the year is constant, you can remove that column. You will just need to do post-processing afterward.

For reference, the following is the specific code for handling datetime values.

def process_datetime_data(
    series, transform_data: Dict = None
) -> Tuple[pd.Series, Dict]:
    # Get the max_len from the current time.
    # This will be ignored later if the actual max_len
    # is shorter.
    max_len = len(str(int(time.time())))


    # Convert the datetimes to
    # their equivalent timestamp values.


    # Make sure that we don't convert the NaT
    # to some integer.
    series = series.copy()
    series.loc[series.notnull()] = (series[series.notnull()].view(int) / 1e9).astype(
        int
    )
    series = series.fillna(pd.NA)


    # Take the mean value to re-align the data.
    # This will help reduce the scale of the numeric
    # data that will need to be generated. Let's just
    # add this offset back later before casting.
    mean_date = None


    if transform_data is None:
        mean_date = int(series.mean())
        series -= mean_date
    else:
        # The mean_date should have been
        # stored during fitting.
        series -= transform_data["mean_date"]


    # Then apply the numeric data processing
    # pipeline.
    series, transform_data = process_numeric_data(
        series,
        max_len=max_len,
        numeric_precision=0,
        transform_data=transform_data,
    )


    # Store the `mean_date` here because `process_numeric_data`
    # expects a None transform_data during fitting.
    if mean_date is not None:
        transform_data["mean_date"] = mean_date


    return series, transform_data```

from realtabformer.

liu305 avatar liu305 commented on May 13, 2024

Hi @avsolatorio. Thank you very much for your timely response! Correspondingly I have some further questions below which I would appreciate your input.

  1. Probably in my previous experiments I just used some high cardinality variables as they are, which made the vocabulary size so huge! Do you think a smaller vocabulary size would help in this case (so that output_max_length requirement can be relaxed)?
  2. In the same experiment, I also saw that in some other columns of generated data there are invalid values. For example, in the code column I even see name strings, which definitely should belong to another column. Is it also because of the same reason as the invalid datetime values, which is that output_max_length truncation makes things wrong.
  3. Do you think column ordering matters. For example, currently datetime column is the last column. If I move it to the first column, will it be less impacted by the output_max_length?

Best Regards,

from realtabformer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.