GithubHelp home page GithubHelp logo

vita-group / bert-tickets Goto Github PK

View Code? Open in Web Editor NEW
137.0 137.0 19.0 3.37 MB

[NeurIPS 2020] "The Lottery Ticket Hypothesis for Pre-trained BERT Networks", Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin

License: MIT License

Python 98.59% Shell 0.08% Makefile 0.02% Dockerfile 0.06% CSS 0.09% JavaScript 0.25% Jupyter Notebook 0.91%
bert lottery-ticket-hypothesis lottery-tickets pre-training universal-embeddings

bert-tickets's People

Contributors

tianlong-chen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-tickets's Issues

Transformer Vesion

May I know the transformer library version that was used for this work? I couldn't find it in the project readme file and the latest version raises many errors. Thanks.

Duplicated code ¿?

Hello,

Thanks for this repo. I am trying to extract the GraSP code from:
https://github.com/VITA-Group/BERT-Tickets/blob/master/pretrain_grasp.py

But in https://github.com/VITA-Group/BERT-Tickets/blob/master/pretrain_grasp.py#L431 there is no import or definition of this function inside the .py. I mean, I have not tried to run the file but if it is not declared or imported from anywhere I imagine that it is not going to work.

Here you have commented the imports. Should I uncomment this line to import the pruning_model_custom, see_weight_rate, and pruning_model?

Is for this reason that I don't like to do absolute imports with * because you use to loose the trace. :(

Big anticipated thanks.
Best regards

Rewinding doesn't work

I'm afraid that the rewinding doesn't work, because the code uses an assignment other than torch.clone function. It means the original weights is lost and would be optimised when training.

RuntimeError: CUDA out of memory

Hello,

I have extracted the GraSP algorithm from here:
https://github.com/VITA-Group/BERT-Tickets/blob/master/transformers-master/examples/GraSP.py

I am getting CUDA out of memory on whatever size my architecture have. I have migrated your version of transformers to the new one.

Here is my trainer.py script:
https://gist.github.com/gaceladri/f301f33779785c401c8c1e549bcc1144

I have just adapted your pretrain_grasp.py to the trainer.py script on Transformers 3.3.1

The error comes at the backward pass in the (2) iteration:

z.backward()

Could it be that we are doing the backward pass with the main model loaded and we are duplicating the weights?

Thanks

Sparse models available for download?

Hello :-)

I found your paper very interesting and would like to play with the sparse models that it produced. Are you going to open-source them soon?

[Question] the learning rate scheduler may not reach zero, right?

The implementation doesn't seem to follow the description in the paper:

Learning rate decays linearly from initial value to zero

When running the LT_glue.py command shown in README, the script sets --num_train_epochs as 30 and it means training each pruned subnetwork takes 3 epochs (and the [0%, 10%, ...90%]-pruned subnets take 30 epochs in total).
However, the learning rate scheduler depends on the --num_train_epochs (30) rather than 3:

t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
...
scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)

So, the last learning rate after 3 epochs is 1.8e-5 (= 2e-5 * (1 - 3/30)) (rather than zero) if we start from 2e-5.
Is my understanding correct?

Sure this minor difference doesn't hurt the great contributions of the work. I just wanna confirm it and know the detail for reproduction and my own further research. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.