vita-group / bert-tickets Goto Github PK

[NeurIPS 2020] "The Lottery Ticket Hypothesis for Pre-trained BERT Networks", Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

License: MIT License

Python 98.59% Shell 0.08% Makefile 0.02% Dockerfile 0.06% CSS 0.09% JavaScript 0.25% Jupyter Notebook 0.91%

bert lottery-ticket-hypothesis lottery-tickets pre-training universal-embeddings

bert-tickets's People

Contributors

Stargazers

Watchers

Forkers

hfxunlp giannisdaras kiminh frankfan007 doinker mark-ting jt-street53 harrywuhust2022 joshuacook natuan lwm20002000 huseyinatahaninan llyx97 ribhavb wf-hahaha iambusayor giacomoverardo arcala-research-lab

bert-tickets's Issues

Transformer Vesion

May I know the transformer library version that was used for this work? I couldn't find it in the project readme file and the latest version raises many errors. Thanks.

[Question]The dataset for (IMP)MLM

Hi, I couldn't find information about the dataset for running (IMP)MLM. Is it WikiText-2 or WikiText-103?

Question on the custom pruning function

Hi,
Thanks for sharing your codebase.
I wonder what exactly is doing the following part of your code as "pruning_model_custom".

BERT-Tickets/squad_trans.py

Line 914 in 4d8e035

pruning_model_custom(model, mask)

Thanks for your help
Farhad

Duplicated code ¿?

Hello,

Thanks for this repo. I am trying to extract the GraSP code from:
https://github.com/VITA-Group/BERT-Tickets/blob/master/pretrain_grasp.py

But in https://github.com/VITA-Group/BERT-Tickets/blob/master/pretrain_grasp.py#L431 there is no import or definition of this function inside the .py. I mean, I have not tried to run the file but if it is not declared or imported from anywhere I imagine that it is not going to work.

Here you have commented the imports. Should I uncomment this line to import the pruning_model_custom, see_weight_rate, and pruning_model?

Is for this reason that I don't like to do absolute imports with * because you use to loose the trace. :(

Big anticipated thanks.
Best regards

Rewinding doesn't work

I'm afraid that the rewinding doesn't work, because the code uses an assignment other than torch.clone function. It means the original weights is lost and would be optimised when training.

RuntimeError: CUDA out of memory

Hello,

I have extracted the GraSP algorithm from here:
https://github.com/VITA-Group/BERT-Tickets/blob/master/transformers-master/examples/GraSP.py

I am getting CUDA out of memory on whatever size my architecture have. I have migrated your version of transformers to the new one.

Here is my trainer.py script:
https://gist.github.com/gaceladri/f301f33779785c401c8c1e549bcc1144

I have just adapted your pretrain_grasp.py to the trainer.py script on Transformers 3.3.1

The error comes at the backward pass in the (2) iteration:

BERT-Tickets/pretrain_grasp.py

Line 729 in c55e3d9

z.backward()

Could it be that we are doing the backward pass with the main model loaded and we are duplicating the weights?

Thanks

Sparse models available for download?

Hello :-)

I found your paper very interesting and would like to play with the sparse models that it produced. Are you going to open-source them soon?

[Question] the learning rate scheduler may not reach zero, right?

The implementation doesn't seem to follow the description in the paper:

Learning rate decays linearly from initial value to zero

When running the LT_glue.py command shown in README, the script sets --num_train_epochs as 30 and it means training each pruned subnetwork takes 3 epochs (and the [0%, 10%, ...90%]-pruned subnets take 30 epochs in total).
However, the learning rate scheduler depends on the --num_train_epochs (30) rather than 3:

t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
...
scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)

So, the last learning rate after 3 epochs is 1.8e-5 (= 2e-5 * (1 - 3/30)) (rather than zero) if we start from 2e-5.
Is my understanding correct?

Sure this minor difference doesn't hurt the great contributions of the work. I just wanna confirm it and know the detail for reproduction and my own further research. Thanks!

vita-group / bert-tickets Goto Github PK

bert-tickets's People

Contributors

Stargazers

Watchers

Forkers

bert-tickets's Issues

Transformer Vesion

[Question]The dataset for (IMP)MLM

Question on the custom pruning function

Duplicated code ¿?

Rewinding doesn't work

RuntimeError: CUDA out of memory

Sparse models available for download?

[Question] the learning rate scheduler may not reach zero, right?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs