Thanks for your interesting work! I am trying to do some experiments using your co

Can you provide all back translation data? about mixtext HOT 7 CLOSED

salt-nlp commented on May 17, 2024

Can you provide all back translation data?

from mixtext.

Comments (7)

HeimingX commented on May 17, 2024

Hi, thanks for the impressive work and sharing the code.
I also meet a similar problem related to the back translation pre-processing.

For DBpedia and IMDB datasets, there are many texts contains more than 1024 words and it will raise an error: "fairseq exception size of sample #0 is invalid (=(1162, 0)) since max_positions=(1024, 1024), skip this example with --skip-invalid-size-inputs-valid-test" without any pre-processing. I'd like to know if you also meet this problem? How do you fix it? I tried to keep the first 1024 words of each text before doing the back translation, and i am not sure if it is suitable?
it seems that the generation of back translated data is really time consuming and it also contains some randomness during the generation. I am not sure if this will have an effect on the performance. So I wonder could you be kind to provide all the back translation data that you have used?

Many thanks and happy new year!

from mixtext.

jiaaoc commented on May 17, 2024

Hi, maybe it is because of the version of fairseq, I did not encounter that error if I remember correctly. But in that case, for DBpedia, you could keep the first 1024 words and for IMDB you could keep the last 1024 words.

Yes, back-translation is a time-consuming process. I will try to find the back-translated data for the other three datasets. But I could not guarantee that I could get them as the servers storing them are expired. Also, I believe the performance would not vary too much with different back-translations.

from mixtext.

HeimingX commented on May 17, 2024

Hi, thank you for your timely reply.

For IMDB dataset, do you have any reasons for the recommendation of keep the last 1024 words but not the first 1024 words?

from mixtext.

jiaaoc commented on May 17, 2024

We found that keeping the last 1024/512 words would get better performances, probably because people often summarize their ratings/sentiment at the end of the IMDB reviews.

from mixtext.

HeimingX commented on May 17, 2024

Hi, thanks again for your prompt and effective response. I will have a try and I am still looking forword to have your all back-translation augmented data been open sourced ASAP. Many many thanks! Cheers!

from mixtext.

HeimingX commented on May 17, 2024

We found that keeping the last 1024/512 words would get better performances, probably because people often summarize their ratings/sentiment at the end of the IMDB reviews.

Hi, I have one more question about the IMDB dataset, since you choose to use last 1024/512 words to do the back-translation augmentation, I want to know if we also need to keep the last MAX_SEQ_LEN tokens from the original texts before feed into the BERT? (according to the code, it seems always use tokens from the head). Regarding to the MAX_SEQ_LEN, does all the four datasets share the same length(256)? Thanks a lot.

from mixtext.

jiaaoc commented on May 17, 2024

Yes, you could keep the last MAX_SEQ_LEN tokens from the original texts before feeding into the BERT.

from mixtext.

Can you provide all back translation data? about mixtext HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs