titu1994 / keras-adabound Goto Github PK

View Code? Open in Web Editor NEW

132.0 6.0 35.0 3.59 MB

Keras implementation of AdaBound

License: MIT License

Python 100.00%

deep-learning optimizer keras

keras-adabound's People

Contributors

Stargazers

Watchers

keras-adabound's Issues

Using SGDM with lr=0.1 leads to not learning

Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below.

I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

Make a PR to the main keras repo?

I think it might make sense to add this optimizer to the main keras repo!

Can't set attribute

File "/adabound.py", line 40, in init
self.lr = K.variable(lr, name='lr')
AttributeError: can't set attribute

clip by value

https://github.com/CyberZHG/keras-adabound/blob/master/keras_adabound/optimizers.py

K.minimum(K.maximum(step, lower_bound), upper_bound)

will not work?

any explanation of final_lr ?

I cannot find "final lr" term in original paper.
Can you please explain what is this.

Unclear how to import and use tf.keras version

I have downloaded the files and placed them in a folder in the site packages for my virtual environment but I can't get this to work. I have added the folder path to sys.path and verified it is listed. I'm running Tensorflow 2.1.0. What am I doing wrong?

suggestion: allow to train x2 or x3 bigger networks on same vram with TF backend

same as my PR keras-team/keras-contrib#478
works only with TF backend

class AdaBound(Optimizer):
    """AdaBound optimizer.
    Default parameters follow those provided in the original paper.
    # Arguments
        lr: float >= 0. Learning rate.
        final_lr: float >= 0. Final learning rate.
        beta_1: float, 0 < beta < 1. Generally close to 1.
        beta_2: float, 0 < beta < 1. Generally close to 1.
        gamma: float >= 0. Convergence speed of the bound function.
        epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
        decay: float >= 0. Learning rate decay over each update.
        weight_decay: Weight decay weight.
        amsbound: boolean. Whether to apply the AMSBound variant of this
            algorithm.
        tf_cpu_mode: only for tensorflow backend
              0 - default, no changes.
              1 - allows to train x2 bigger network on same VRAM consuming RAM
              2 - allows to train x3 bigger network on same VRAM consuming RAM*2
                  and CPU power.
    # References
        - [Adaptive Gradient Methods with Dynamic Bound of Learning Rate]
          (https://openreview.net/forum?id=Bkg3g2R9FX)
        - [Adam - A Method for Stochastic Optimization]
          (https://arxiv.org/abs/1412.6980v8)
        - [On the Convergence of Adam and Beyond]
          (https://openreview.net/forum?id=ryQu7f-RZ)
    """

    def __init__(self, lr=0.001, final_lr=0.1, beta_1=0.9, beta_2=0.999, gamma=1e-3,
                 epsilon=None, decay=0., amsbound=False, weight_decay=0.0, tf_cpu_mode=0, **kwargs):
        super(AdaBound, self).__init__(**kwargs)

        if not 0. <= gamma <= 1.:
            raise ValueError("Invalid `gamma` parameter. Must lie in [0, 1] range.")

        with K.name_scope(self.__class__.__name__):
            self.iterations = K.variable(0, dtype='int64', name='iterations')
            self.lr = K.variable(lr, name='lr')
            self.beta_1 = K.variable(beta_1, name='beta_1')
            self.beta_2 = K.variable(beta_2, name='beta_2')
            self.decay = K.variable(decay, name='decay')

        self.final_lr = final_lr
        self.gamma = gamma

        if epsilon is None:
            epsilon = K.epsilon()
        self.epsilon = epsilon
        self.initial_decay = decay
        self.amsbound = amsbound

        self.weight_decay = float(weight_decay)
        self.base_lr = float(lr)
        self.tf_cpu_mode = tf_cpu_mode

    def get_updates(self, loss, params):
        grads = self.get_gradients(loss, params)
        self.updates = [K.update_add(self.iterations, 1)]

        lr = self.lr
        if self.initial_decay > 0:
            lr = lr * (1. / (1. + self.decay * K.cast(self.iterations,
                                                      K.dtype(self.decay))))

        t = K.cast(self.iterations, K.floatx()) + 1

        # Applies bounds on actual learning rate
        step_size = lr * (K.sqrt(1. - K.pow(self.beta_2, t)) /
                          (1. - K.pow(self.beta_1, t)))

        final_lr = self.final_lr * lr / self.base_lr
        lower_bound = final_lr * (1. - 1. / (self.gamma * t + 1.))
        upper_bound = final_lr * (1. + 1. / (self.gamma * t))

        e = K.tf.device("/cpu:0") if self.tf_cpu_mode > 0 else None
        if e: e.__enter__()
        ms = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        vs = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        if self.amsbound:
            vhats = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params]
        else:
            vhats = [K.zeros(1) for _ in params]
        if e: e.__exit__(None, None, None)
        
        self.weights = [self.iterations] + ms + vs + vhats

        for p, g, m, v, vhat in zip(params, grads, ms, vs, vhats):
            # apply weight decay
            if self.weight_decay != 0.:
                g += self.weight_decay * K.stop_gradient(p)

            e = K.tf.device("/cpu:0") if self.tf_cpu_mode == 2 else None
            if e: e.__enter__()                    
            m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
            v_t = (self.beta_2 * v) + (1. - self.beta_2) * K.square(g)
            if self.amsbound:
                vhat_t = K.maximum(vhat, v_t)
                self.updates.append(K.update(vhat, vhat_t))
            if e: e.__exit__(None, None, None)
            
            if self.amsbound:
                denom = (K.sqrt(vhat_t) + self.epsilon)
            else:
                denom = (K.sqrt(v_t) + self.epsilon)                        

            # Compute the bounds
            step_size_p = step_size * K.ones_like(denom)
            step_size_p_bound = step_size_p / denom
            bounded_lr_t = m_t * K.minimum(K.maximum(step_size_p_bound,
                                                     lower_bound), upper_bound)

            p_t = p - bounded_lr_t

            self.updates.append(K.update(m, m_t))
            self.updates.append(K.update(v, v_t))
            new_p = p_t

            # Apply constraints.
            if getattr(p, 'constraint', None) is not None:
                new_p = p.constraint(new_p)

            self.updates.append(K.update(p, new_p))
        return self.updates

    def get_config(self):
        config = {'lr': float(K.get_value(self.lr)),
                  'final_lr': float(self.final_lr),
                  'beta_1': float(K.get_value(self.beta_1)),
                  'beta_2': float(K.get_value(self.beta_2)),
                  'gamma': float(self.gamma),
                  'decay': float(K.get_value(self.decay)),
                  'epsilon': self.epsilon,
                  'weight_decay': self.weight_decay,
                  'amsbound': self.amsbound}
        base_config = super(AdaBound, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

about lr

Thanks for a good optimizer
According to usage
optm = AdaBound(lr=1e-03,
final_lr=0.1,
gamma=1e-03,
weight_decay=0.,
amsbound=False)
Does the learning rate gradually increase by the number of steps?

final lr is described as Final learning rate.
but it actually is leaning rate relative to base lr and current klearning rate?

keras-adabound/adabound.py

Line 72 in 5ce819b

final_lr = self.final_lr * lr / self.base_lr

AdaBound.iterations

this param is not saved.

I looked at official pytorch implementation from original paper.
https://github.com/Luolc/AdaBound/blob/master/adabound/adabound.py

it has

# State initialization
if len(state) == 0:
    state['step'] = 0

state is saved with the optimizer.

also it has

# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)

these values should also be saved

So your keras implementation is wrong.

Unexpected keyword argument passed to optimizer: amsbound

I installed with
pip install keras-adabound
imported with:
from keras_adabound import AdaBound
and declared the optimizer as:
opt = AdaBound(lr=1e-03,final_lr=0.1, gamma=1e-03, weight_decay=0., amsbound=False)
Then, I'm getting the error:
TypeError: Unexpected keyword argument passed to optimizer: amsbound

changing the pip install to adabound (instead of keras-adabound) and the import to from adabound import AdaBound, the keyword amsbound is recognized, but then I get the error:
TypeError: __init__() missing 1 required positional argument: 'params'

Am I mixing something up here or missing something?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble