GithubHelp home page GithubHelp logo

Comments (13)

Eureka6174 avatar Eureka6174 commented on May 14, 2024

Sorry for the late reply.

Their are output layers both in pre-training and fine-tuning. But they are different. And for different fine-tuning tasks, the simplest way is to use different output layers. For now, we just used linear layers as output layers. It's still open to try for other options.

For the BPE problem, we only produce one verb for the first subword "materi@@". We choose this option just to follow BERT. We didn't try other options. If you have better choice, could you also tell us?

Please feel free to contact us if this doesn't fully answer your question.

from unicoder.

thomas-happify avatar thomas-happify commented on May 14, 2024

@Eureka6174
Hi there!

So for the XLM-R text generation, what exactly is the decoder?
Is it also a simple linear layer instead of initializing another XLM-R as decoder?

from unicoder.

Eureka6174 avatar Eureka6174 commented on May 14, 2024

It's transformers layers in decoder. Masked self-attention layers are initialized from XLM-R and attention from decoder to encoder is random initialized.

from unicoder.

thomas-happify avatar thomas-happify commented on May 14, 2024

@Eureka6174
Thanks! that make sense.

BTW Is the pre-training code in this repo as well? I'm only seeing generation_from_pretrained_xlmr.py but I don't think that's the one right? I'm really interested how you further pretrained the model with decoder.

Thanks!

from unicoder.

ever4244 avatar ever4244 commented on May 14, 2024

Thanks for the answer. I just wonder would it be better if we choose the last subword instead of the first subword as the position of POS-tag.

Another Question:

Q1:
I notice that there is a model_type flag provided during testing to indicate what pre-trained model is used.

If I trained my own model using fairseq-train (for example a classic NMT transformer model)
How do I use it in the POS-tag and NER testing task?
It would have different dimensions and layers comparing to the pre-trained XLM and BERT models you provided.

Can I just declare it as an XLM and run the NER testing codes (since they share the same encoder)?

Q2: Any example and code for pre-training from scratch?
I currently just trying to use the multilingual translation script in the generation/example to pre-train a model, but there are many different tasks for model pre-training in your paper. I understand that you fine-tune the existing XLM-R models using language modeling. I just wonder if there is a pre-training example that trains the model from scratch so that I can change the size and dimension with more flexibility.

Q3:
I see that in the "generation" folder the unicoder X_dae model is fairseq based.
In the "understanding" folder, the pre-trained model is huggings transformer-based.
So can I used them interchangeably? for example, if I trained/fine-tuned a model in the generation folder with fairseq, can I move it to the understanding folder and testing the model there?
It seems to me that the fairseq trained model is xxx.pt while the huggings transformer-based model is saved as pytorch_model.bin and config.json. So I am puzzled about how to use one encoder for generation and understanding tasks.

Thank for very much!

from unicoder.

Eureka6174 avatar Eureka6174 commented on May 14, 2024

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

from unicoder.

ever4244 avatar ever4244 commented on May 14, 2024

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

Thank you for the timely response!
Can you elaborate Q1 or give me a link on "use the code from HuggingFace to transform it to HuggingFace format"
Are you referring to this one?
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
I am not sure whether this is only for the standard model or can it convert the model from a different structure. My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?

BTW:
I trained a model using fairseq-train, but I found that in the Generation folder your pre-trained model contains a "sentencepiece.bpe.model". I don't have this file when compiling data into bpe, I just got "check.pt" and "dict.txt", I want to ask in which process do you get sentencepiece.bpe.model?

Thank you very much!

from unicoder.

Eureka6174 avatar Eureka6174 commented on May 14, 2024

There is the link: https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py

I think you could read the document of HuggingFace and Transformers at first.

Thanks,
Yaobo

from unicoder.

ever4244 avatar ever4244 commented on May 14, 2024

Thanks for the link, but I have modified my question so I restate it a bit.

Q1:
I also find a link:
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
But both your link and my link seem to be the conversions of a standard model structure. What if have a different model structure?
My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?
As my current pre-trained model is a transformer NMT model 6 encoder layer and 1 decoder layer, Can I use the roberta converter when bert and NMT share a similar encoder but different decoder? I suppose there is no universal converter of various structures between fairseq and huggings?

I am sorry I have been sing fairseq a lot and new to huggings. I will read more of its documents.

from unicoder.

ever4244 avatar ever4244 commented on May 14, 2024

I trained a model using fairseq-train without spm, but I found that I need a "sentencepiece.bpe.model" for later task.

I used a script similar to fairseq translation example: prepare-wmt14en2de.sh, it does not generate a sentence piece model. And I prepare the data and train the old model with it.

https://github.com/pytorch/fairseq/tree/master/examples/translation

while the one that has a sentence piece model is: prepare-iwslt17-multilingual.sh
python "$SPM_TRAIN"
--input=$TRAIN_FILES
--model_prefix=$DATA/sentencepiece.bpe
--vocab_size=$BPESIZE
--character_coverage=1.0
--model_type=bpe.

I am currently want to re-learn the sentencepiece.bpe.model on the training data for prepare-wmt14en2de.sh.

Since I already trained the model without the sentencepiece.bpe.model, I just want to make sure I can get the exact same training data when I reapply the spm learning script on the old data. So that my previously trained model from prepare-wmt14en2de.sh can be coupled with the newly learnt sentencepiece.bpe.model

However,
In prepare-wmt14en2de.sh it use: fastBPE's learn_bpe.py,
https://github.com/glample/fastBPE
in prepare-iwslt17-multilingual.sh it use: sentencepiece's spm_train.py
https://github.com/google/sentencepiece

They use different codes for bpe learning and encoding. They even use different BPE replacement tokens (@@ for fastBPE and _ for sentencepiece ). So how can create a sentencepiece.bpe.model that can be used together with my old model from fastBPE? (That is a sentencepiece.bpe.model which will generate the exact same training data as fastBPE)

Thank you very much!

from unicoder.

Eureka6174 avatar Eureka6174 commented on May 14, 2024

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

from unicoder.

ever4244 avatar ever4244 commented on May 14, 2024

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

Thanks.
I have some questions on preprocessing as well.

I want my pre-training model replication to be as close as to your model so that there won't be a performance loss due to the difference in the pre-processing text during pre-training and testing. So I want to make sure that I follow your pre-processing procedure on training data.

In prepare-wmt14en2de.sh, they use several scripts to tokenize and clean the corpus.

for example:
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl

https://github.com/pytorch/fairseq/tree/master/examples/translation

In prepare-iwslt17-multilingual.sh, it does not use this preprocessing as sentencepiece can be used on raw text.

So what is your pre-processing procedure for pre-training? I want to use the same tokenizer and normalization as your model in the pre-training and fine-tuning.

from unicoder.

Eureka6174 avatar Eureka6174 commented on May 14, 2024

I'm using just raw text for pre-processing. But I didn't try the tokenizer you mentioned because they are different for different languages. If you would like to have a try, I appreciate it if you could share us your results, either work or not work.

from unicoder.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.