Comments (11)
Hi Lucas! We're actually using sequence-level distillation, so we take the output from a well-trained autoregressive model and use that as the ground truth target.
Have a look at https://github.com/nyu-dl/dl4mt-nonauto/blob/master/run.py#L327-L328, we saved the AR output in a separate file, and load from that file, instead of the actual ground truth.
from dl4mt-nonauto.
Yeah that's right, we used output from the autoregressive transformer (with beam search).
yeah, we're not the only ones :)
https://arxiv.org/pdf/1711.10433.pdf
https://arxiv.org/pdf/1711.02281.pdf
https://arxiv.org/pdf/1902.03249.pdf
from dl4mt-nonauto.
Wow big thanks for the share! I've been working for the past two months on my version of insertion transformer (minus the distillation). Glad to be aware of this now rather than in two months :p
from dl4mt-nonauto.
I actually played with self distillation during summer on WMT'14, that is use the output of non-autoregressive model at iteration 20 as groundtruth after non-autoregressive model more-or-less converged (exactly what @jasonleeinf has described).
I found that it improved BLEU score significantly on first 1-3 iterations of non-autoregressive decoding, but in the latter iterations (like iteration=20) the results only improved by very small score about 0.5 BLEU
from dl4mt-nonauto.
Thanks for the link! So the target is still a one-hot distribution (resulting from beam search on the teacher instead of the actual target) ?
The fact that you guys get a significant increase in performance from this is quite cool!
from dl4mt-nonauto.
What are your thoughts on self distillation ? Did you find that it helped the model converge to a better solution ?
from dl4mt-nonauto.
do you mean taking the output from the non-autoregressive model and feeding back to itself as a training example? we haven't tried that, but i doubt it'll work better than using the output from an AR model
from dl4mt-nonauto.
Cool thanks! I was curious because it was in the code, but not in paper :p
from dl4mt-nonauto.
self-distillation exploits some weird flaw of our approximate decoding. if we used e.g. sampling from the model distribution, on expectation it really shouldn't do much: see https://twitter.com/DeepSpiker/status/1096539984738377734
from dl4mt-nonauto.
Interesting discussion! However does this apply here also ? I would assume that the distribution after 1 vs 20 iterations is different, therefore (even in expectation) self-distillation could provide useful gradients
from dl4mt-nonauto.
Plus the weights of the model at iteration 1 and iterations > 1 are not shared
from dl4mt-nonauto.
Related Issues (14)
- Train loss value computes to zero in every iteration HOT 1
- How is your WMT16 EN-Ro Dataset Preprocessed? HOT 1
- I receive Error for "model.py" HOT 2
- No event loop integration for 'inline'
- RuntimeError: each element in list of batch should be of equal size
- Need the bpe codes files for applying bpe to a new file. HOT 2
- Training error (num_gpu argument) HOT 8
- Reproducing MSCOCO image captioning results HOT 11
- Test data for reproducing IWSLT-16 En-De results HOT 2
- RuntimeError: Error(s) in loading state_dict for FastTransformer: HOT 16
- Is the AR model for NMT tasks transformer? HOT 3
- IWSLT-16 En-De Decoding HOT 1
- different batch_size lead to different results HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dl4mt-nonauto.