GithubHelp home page GithubHelp logo

Comments (11)

jaseleephd avatar jaseleephd commented on July 20, 2024 1

Hi Lucas! We're actually using sequence-level distillation, so we take the output from a well-trained autoregressive model and use that as the ground truth target.

Have a look at https://github.com/nyu-dl/dl4mt-nonauto/blob/master/run.py#L327-L328, we saved the AR output in a separate file, and load from that file, instead of the actual ground truth.

from dl4mt-nonauto.

jaseleephd avatar jaseleephd commented on July 20, 2024 1

Yeah that's right, we used output from the autoregressive transformer (with beam search).

yeah, we're not the only ones :)
https://arxiv.org/pdf/1711.10433.pdf
https://arxiv.org/pdf/1711.02281.pdf
https://arxiv.org/pdf/1902.03249.pdf

from dl4mt-nonauto.

pclucas14 avatar pclucas14 commented on July 20, 2024 1

Wow big thanks for the share! I've been working for the past two months on my version of insertion transformer (minus the distillation). Glad to be aware of this now rather than in two months :p

from dl4mt-nonauto.

mansimov avatar mansimov commented on July 20, 2024 1

I actually played with self distillation during summer on WMT'14, that is use the output of non-autoregressive model at iteration 20 as groundtruth after non-autoregressive model more-or-less converged (exactly what @jasonleeinf has described).

I found that it improved BLEU score significantly on first 1-3 iterations of non-autoregressive decoding, but in the latter iterations (like iteration=20) the results only improved by very small score about 0.5 BLEU

from dl4mt-nonauto.

pclucas14 avatar pclucas14 commented on July 20, 2024

Thanks for the link! So the target is still a one-hot distribution (resulting from beam search on the teacher instead of the actual target) ?
The fact that you guys get a significant increase in performance from this is quite cool!

from dl4mt-nonauto.

pclucas14 avatar pclucas14 commented on July 20, 2024

What are your thoughts on self distillation ? Did you find that it helped the model converge to a better solution ?

from dl4mt-nonauto.

jaseleephd avatar jaseleephd commented on July 20, 2024

do you mean taking the output from the non-autoregressive model and feeding back to itself as a training example? we haven't tried that, but i doubt it'll work better than using the output from an AR model

from dl4mt-nonauto.

pclucas14 avatar pclucas14 commented on July 20, 2024

Cool thanks! I was curious because it was in the code, but not in paper :p

from dl4mt-nonauto.

kyunghyuncho avatar kyunghyuncho commented on July 20, 2024

self-distillation exploits some weird flaw of our approximate decoding. if we used e.g. sampling from the model distribution, on expectation it really shouldn't do much: see https://twitter.com/DeepSpiker/status/1096539984738377734

from dl4mt-nonauto.

pclucas14 avatar pclucas14 commented on July 20, 2024

Interesting discussion! However does this apply here also ? I would assume that the distribution after 1 vs 20 iterations is different, therefore (even in expectation) self-distillation could provide useful gradients

from dl4mt-nonauto.

mansimov avatar mansimov commented on July 20, 2024

Plus the weights of the model at iteration 1 and iterations > 1 are not shared

from dl4mt-nonauto.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.