GithubHelp home page GithubHelp logo

Comments (7)

HeimingX avatar HeimingX commented on May 17, 2024

Hi, thanks for the impressive work and sharing the code.
I also meet a similar problem related to the back translation pre-processing.

  1. For DBpedia and IMDB datasets, there are many texts contains more than 1024 words and it will raise an error: "fairseq exception size of sample #0 is invalid (=(1162, 0)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test" without any pre-processing. I'd like to know if you also meet this problem? How do you fix it? I tried to keep the first 1024 words of each text before doing the back translation, and i am not sure if it is suitable?

  2. it seems that the generation of back translated data is really time consuming and it also contains some randomness during the generation. I am not sure if this will have an effect on the performance. So I wonder could you be kind to provide all the back translation data that you have used?

Many thanks and happy new year!

from mixtext.

jiaaoc avatar jiaaoc commented on May 17, 2024

Hi, maybe it is because of the version of fairseq, I did not encounter that error if I remember correctly. But in that case, for DBpedia, you could keep the first 1024 words and for IMDB you could keep the last 1024 words.

Yes, back-translation is a time-consuming process. I will try to find the back-translated data for the other three datasets. But I could not guarantee that I could get them as the servers storing them are expired. Also, I believe the performance would not vary too much with different back-translations.

from mixtext.

HeimingX avatar HeimingX commented on May 17, 2024

Hi, thank you for your timely reply.

For IMDB dataset, do you have any reasons for the recommendation of keep the last 1024 words but not the first 1024 words?

from mixtext.

jiaaoc avatar jiaaoc commented on May 17, 2024

We found that keeping the last 1024/512 words would get better performances, probably because people often summarize their ratings/sentiment at the end of the IMDB reviews.

from mixtext.

HeimingX avatar HeimingX commented on May 17, 2024

Hi, thanks again for your prompt and effective response. I will have a try and I am still looking forword to have your all back-translation augmented data been open sourced ASAP. Many many thanks! Cheers!

from mixtext.

HeimingX avatar HeimingX commented on May 17, 2024

We found that keeping the last 1024/512 words would get better performances, probably because people often summarize their ratings/sentiment at the end of the IMDB reviews.

Hi, I have one more question about the IMDB dataset, since you choose to use last 1024/512 words to do the back-translation augmentation, I want to know if we also need to keep the last MAX_SEQ_LEN tokens from the original texts before feed into the BERT? (according to the code, it seems always use tokens from the head). Regarding to the MAX_SEQ_LEN, does all the four datasets share the same length(256)? Thanks a lot.

from mixtext.

jiaaoc avatar jiaaoc commented on May 17, 2024

Yes, you could keep the last MAX_SEQ_LEN tokens from the original texts before feeding into the BERT.

from mixtext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.