GithubHelp home page GithubHelp logo

Comments (3)

spencernelsonucla avatar spencernelsonucla commented on May 12, 2024 1

@petulla and @mikiwz - I think I found the reason for this issue. In this line:
https://github.com/georgian-io/Multimodal-Toolkit/blob/master/multimodal_transformers/data/load_data.py#L228

The package concatenates the train, val, and test dfs. Then, if you're precessing the categorical features via one hot encoding, which is the default, it will one hot encode with ALL of those dfs together.

For example, say your train df has a categorical feature with values ["a", "b"]. This would get one hot encoded as 2 separate columns (a and b). However, say your test data has values ["a", "c"]. Well, with the way this is currently packaged, the train and test data is concatenated together and so there will be one hot encoding to produce 3 columns (a, b, and c). But, if you load your test dataset separately, you would only one hot encode "a" and "c" - resulting in 2 columns instead of 3. This is the issue. The model was thus trained on 3 columns, but you're giving it 2 columns to predict with.

The way around this is to either not use categorical data, or use label encoding instead:

test_dataset_2 = load_data(
    test_data,
    data_args.column_info['text_cols'],
    tokenizer,
    label_col=data_args.column_info['label_col'],
    label_list=data_args.column_info['label_list'],
    numerical_cols=data_args.column_info['num_cols'],
    sep_text_token_str=tokenizer.sep_token,
    categorical_encode_type="label"
)

from multimodal-toolkit.

petulla avatar petulla commented on May 12, 2024

I have this same issue when trying to use load_data() on separate test and train dataframes. Weirdly, one of the two tabular_torch_dataset.TorchTextDataset` returned from load_data will train; the other will not.

I have to use the code with the load from file and setup just as in the colab to make work.

@codeKgu Seems like you put a ton of work into this repo. Would be great to get this fixed.

from multimodal-toolkit.

akashsaravanan-georgian avatar akashsaravanan-georgian commented on May 12, 2024

Closing as this has been answered.

from multimodal-toolkit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.