GithubHelp home page GithubHelp logo

github30 / luckmatters Goto Github PK

View Code? Open in Web Editor NEW

This project forked from facebookresearch/luckmatters

0.0 1.0 0.0 49 KB

Understanding Training Dynamics of Deep ReLU Networks

License: Other

Python 100.00%

luckmatters's Introduction

LuckMatters

Code of DL theory for multilayer ReLU networks.

Relevant paper: "Luck Matters: Understanding Training Dynamics of Deep ReLU Networks", Arxiv link.

Usage

Using existing DL library like PyTorch. Lowest layer gets 0.999 in a few iterations, second lowest can get to ~0.95.

python recon_multilayer.py --data_std 10.0 --node_multi 10 --lr 0.05 --dataset gaussian --d_output 100 --seed 124

Matrix version to check over-parameterization theorem (should be able to see the second layer relevant weights are zero).

python test_multilayer.py --perturb --node_multi 2 --lr 0.05 --init_std 0.1 --batchsize 64 --seed 232 --verbose

Check W_row_norm and we can find that:

[1]: W_row_norm: tensor([1.2050e+00, 1.2196e+00, 1.1427e+00, 1.3761e+00, 1.1161e+00, 1.4610e+00,
        1.1305e+00, 1.0719e+00, 1.1388e+00, 1.2870e+00, 1.2480e+00, 1.1709e+00,
        1.2928e+00, 1.2677e+00, 1.2754e+00, 1.1399e+00, 1.1465e+00, 1.1292e+00,
        1.4311e+00, 1.1534e+00, 1.1562e-04, 1.0990e-04, 9.2137e-05, 8.3408e-05,
        1.2864e-04, 2.3824e-04, 1.0199e-04, 1.1282e-04, 1.1691e-04, 1.4917e-03,
        1.5522e-04, 6.1745e-05, 1.1086e-04, 1.8588e-04, 1.1351e-04, 2.4844e-04,
        1.3347e-04, 6.5837e-05, 1.5340e-03, 9.1208e-05, 4.2515e-05])

Other usage:

Matrix version backprapagation:

python test_multilayer.py --init_std 0.1 --lr 0.2 --seed 433

Precise gradient (single sample gradient accumulation, very slow)

python test_multilayer.py --init_std 0.1 --lr 0.2 --seed 433 --use_accurate_grad

Note that

  1. data_std needs to be 10 so that the generated dataset will cover corners (if it is 1 then we won't be able to cover all corners and the correlation is low).

  2. It looks like node_multi = 10 is probably good enough. More node_multi makes it slower (in terms of steps) to converge.

  3. More supervision definitely helps. It looks like the larger d_output the better. d_output = 10 also works (also all 0.999 in the lowest layer) but not as good as d_output = 100.

  4. High lr seems to make it unstable.

  5. Add --normalize makes it a bit worse. More secret in BatchNorm!

Visualization code

Will be released soon.

License

See LICENSE file.

luckmatters's People

Contributors

yuandong-tian avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.