GithubHelp home page GithubHelp logo

How to reproduce results from paper? about lmm HOT 7 CLOSED

j0ma avatar j0ma commented on July 28, 2024
How to reproduce results from paper?

from lmm.

Comments (7)

j0ma avatar j0ma commented on July 28, 2024

Friendly ping to @d-ataman in a separate comment to make sure there is a notification. :)

from lmm.

d-ataman avatar d-ataman commented on July 28, 2024

Hi J0MA,

Thanks for your interest in the code! I updated the example scripts so you can use the same arguments in training/translate. No you don't need any dependency from OpenNMT-py or the CharNMT repo, that was the version that implements character LSTM in the encoder, lmm repository should be sufficient to run the code. Please check the requirements file to check if you have all libraries installed.

Note that the translation script (in translate/Translator.py) uses the hierarchical beam search algorithm, which does not support batch translation, so you can use the arguments in the examples to do the translation. I have to warn that this is quite slow.

This is where you can find the data: https://wit3.fbk.eu/ of course you would need to convert to txt. You should tokenize/lowercase/truecase both sides of the corpora, then apply BPE of 16000 merge rules for the source (EN) side, you can leave target language files in the original word format, and run preprocess.sh in the examples directory, the code will automatically load data and make subword batches for the source and separate target sentences into word/character-level batches.

Let me know if you encounter any problems!

from lmm.

j0ma avatar j0ma commented on July 28, 2024

Hi again, and thank you for your response!

A few follow-up questions:

  1. I managed to download the dataset from the WIT3 website, along with the processing tools written in Perl. I assume these tools would be useful for preprocessing and converting to txt?
  2. Regarding tokenization/lowercasing/truecasing the sentence txt files -- is there a specific tool you used for this?
  3. You mentioned BPE of 16k merge operations, however in Section 4.3 of the paper a BPE size of 32k is mentioned (see below). Can you clarify?

image

  1. Finally, is there a specific tool you use for running BPE? I know there are many implementations available, e.g. subword-nmt, SentencePiece etc. I did notice that onmt.io.Wordbatch contains a split_bpe() method but that doesn't seem like the right function. OpenNMT-py currently has an implementation of learning BPE but it seems like it's not included in the lmm repository.

Thanks so much again for your help! Looking forward to getting this reproduction going. :)

from lmm.

j0ma avatar j0ma commented on July 28, 2024

Hi again @d-ataman !

Regarding the IWSLT datasets, are they also accessible from the WIT3 website? I am only able to see data going back to 2011 when I visit https://wit3.fbk.eu, and not all of those even have dev/test sets available (based on the description).

Thanks!

from lmm.

d-ataman avatar d-ataman commented on July 28, 2024

Hi @j0ma

Yes you can download the data from WIT3.
For preprocessing you can use the moses scripts:

Tokenization and Lowercasing: https://github.com/moses-smt/mosesdecoder/tree/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/tokenizer

Truecasing: https://github.com/moses-smt/mosesdecoder/tree/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/recaser

For subword segmentation I use the subword-nmt scripts: https://github.com/rsennrich/subword-nmt. For the English side you can use 16k merge rules, you don't need to segment the target side. 32k was used in the larger training data.

I also updated the example scripts for you.

Best wishes,
Duygu

from lmm.

j0ma avatar j0ma commented on July 28, 2024

Thanks again for your help @d-ataman !

I've now got all the preprocessing done, and preprocess.sh runs successfully. However, I noticed that I get several weird errors about tensor shapes / missing attributes. (for details, see here)

Currently I'm using torchtext 0.2.1 (based on this) and tried pytorch 0.3.1, 0.4.0 and 1.4.0, without any luck.

Therefore, I was wondering what pytorch (& torchtext) version the codebase is based on?

from lmm.

d-ataman avatar d-ataman commented on July 28, 2024

Hi @j0ma ,

I think last time I ran the code it was with torch 0.4.1.
Hope it works!

Duygu

from lmm.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.