Hi, First off, thanks for sharing your code. I'm trying to understan

Hi Lucas! We're actually using <a href="https://arxiv.org/abs/1606.07947" rel="nofollo

General information about distillation about dl4mt-nonauto HOT 11 CLOSED

pclucas14 commented on July 20, 2024

General information about distillation

from dl4mt-nonauto.

Comments (11)

jaseleephd commented on July 20, 2024 1

Hi Lucas! We're actually using sequence-level distillation, so we take the output from a well-trained autoregressive model and use that as the ground truth target.

Have a look at https://github.com/nyu-dl/dl4mt-nonauto/blob/master/run.py#L327-L328, we saved the AR output in a separate file, and load from that file, instead of the actual ground truth.

from dl4mt-nonauto.

jaseleephd commented on July 20, 2024 1

Yeah that's right, we used output from the autoregressive transformer (with beam search).

yeah, we're not the only ones :)
https://arxiv.org/pdf/1711.10433.pdf
https://arxiv.org/pdf/1711.02281.pdf
https://arxiv.org/pdf/1902.03249.pdf

from dl4mt-nonauto.

pclucas14 commented on July 20, 2024 1

Wow big thanks for the share! I've been working for the past two months on my version of insertion transformer (minus the distillation). Glad to be aware of this now rather than in two months :p

from dl4mt-nonauto.

mansimov commented on July 20, 2024 1

I actually played with self distillation during summer on WMT'14, that is use the output of non-autoregressive model at iteration 20 as groundtruth after non-autoregressive model more-or-less converged (exactly what @jasonleeinf has described).

I found that it improved BLEU score significantly on first 1-3 iterations of non-autoregressive decoding, but in the latter iterations (like iteration=20) the results only improved by very small score about 0.5 BLEU

from dl4mt-nonauto.

pclucas14 commented on July 20, 2024

Thanks for the link! So the target is still a one-hot distribution (resulting from beam search on the teacher instead of the actual target) ?
The fact that you guys get a significant increase in performance from this is quite cool!

from dl4mt-nonauto.

pclucas14 commented on July 20, 2024

What are your thoughts on self distillation ? Did you find that it helped the model converge to a better solution ?

from dl4mt-nonauto.

jaseleephd commented on July 20, 2024

do you mean taking the output from the non-autoregressive model and feeding back to itself as a training example? we haven't tried that, but i doubt it'll work better than using the output from an AR model

from dl4mt-nonauto.

pclucas14 commented on July 20, 2024

Cool thanks! I was curious because it was in the code, but not in paper :p

from dl4mt-nonauto.

kyunghyuncho commented on July 20, 2024

self-distillation exploits some weird flaw of our approximate decoding. if we used e.g. sampling from the model distribution, on expectation it really shouldn't do much: see https://twitter.com/DeepSpiker/status/1096539984738377734

from dl4mt-nonauto.

pclucas14 commented on July 20, 2024

Interesting discussion! However does this apply here also ? I would assume that the distribution after 1 vs 20 iterations is different, therefore (even in expectation) self-distillation could provide useful gradients

from dl4mt-nonauto.

mansimov commented on July 20, 2024

Plus the weights of the model at iteration 1 and iterations > 1 are not shared

from dl4mt-nonauto.

General information about distillation about dl4mt-nonauto HOT 11 CLOSED

Comments (11)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs