mpyrozhok / adamwr Goto Github PK

Implements https://arxiv.org/abs/1711.05101 AdamW optimizer, cosine learning rate scheduler and "Cyclical Learning Rates for Training Neural Networks" https://arxiv.org/abs/1506.01186 for PyTorch framework

License: MIT License

Python 100.00%

clr scheduler pytorch adamw adamw-optimizer restarts triangular optimizer cyclical-learning-rate cosine-annealing

adamwr's Introduction

AdamW optimizer and cosine learning rate annealing with restarts

This repository contains an implementation of AdamW optimization algorithm and cosine learning rate scheduler described in "Decoupled Weight Decay Regularization". AdamW implementation is straightforward and does not differ much from existing Adam implementation for PyTorch, except that it separates weight decaying from batch gradient calculations. Cosine annealing scheduler with restarts allows model to converge to a (possibly) different local minimum on every restart and normalizes weight decay hyperparameter value according to the length of restart period. Unlike schedulers presented in standard PyTorch scheduler suite this scheduler adjusts optimizer's learning rate not on every epoch, but on every batch update, according to the paper.

Cyclical Learning Rates

Besides "cosine" and "arccosine" policies (arccosine has steeper profile at the limiting points), there are "triangular", triangular2 and exp_range, which implement policies proposed in "Cyclical Learning Rates for Training Neural Networks". The ratio of increasing and decreasing phases for triangular policy could be adjusted with triangular_step parameter. Minimum allowed lr is adjusted by min_lr parameter.

triangular schedule is enabled by passing policy="triangular" parameter.
triangular2 schedule reduces maximum lr by half on each restart cycle and is enabled by passing policy="triangular2" parameter, or by combining parameters policy="triangular", eta_on_restart_cb=ReduceMaxLROnRestart(ratio=0.5). The ratio parameter regulates the factor by which lr is scaled on each restart.
exp_range schedule is enabled by passing policy="exp_range" parameter. It exponentially scales maximum lr depending on iteration count. The base of exponentiation is set by gamma parameter.

These schedules could be combined with shrinking/expanding restart periods, weight decay normalization and could be used with AdamW and other PyTorch optimizers.

Example:

    batch_size = 32
    epoch_size = 1024
    model = resnet()
    optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = CyclicLRWithRestarts(optimizer, batch_size, epoch_size, restart_period=5, t_mult=1.2, policy="cosine")
    for epoch in range(100):
        scheduler.step()
        train_for_every_batch(...)
            ...
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.batch_step()
        validate(...)

adamwr's People

Contributors

Stargazers

Watchers

adamwr's Issues

Getting Stop Iteration when running for training

StopIteration Traceback (most recent call last)
in ()
1 training(model=model, epoch=20, eval_every=500,
2 loss_func=loss_function, optimizer=optimizer, train_iter=train_iter,
----> 3 val_iter=val_iter, scheduler=scheduler, warmup_epoch=3, early_stop=2)

in training(epoch, model, eval_every, loss_func, optimizer, train_iter, val_iter, scheduler, early_stop, warmup_epoch)
37 loss.backward()
38 optimizer.step()
---> 39 scheduler.batch_step()
40 if step % eval_every == 0:
41 model.eval()

in batch_step(self)
274
275 def batch_step(self):
--> 276 t_cur = self.t_epoch + next(self.batch_increment)
277 for param_group, (lr, weight_decay) in zip(self.optimizer.param_groups,
278 self.get_lr(t_cur)):

StopIteration:

Hypergradient Descent

Thank you for sharing this. Would it be possible if you can also integrate Hypergradient Descent technique into your AdamW implementation? It reduces the necessity of hypertuning the initial learning rate. https://github.com/gbaydin/hypergradient-descent

                if state['step'] > 1:
                    prev_bias_correction1 = 1 - beta1 ** (state['step'] - 1)
                    prev_bias_correction2 = 1 - beta2 ** (state['step'] - 1)
                    # Hypergradient for Adam:
                    h = torch.dot(grad.view(-1), torch.div(exp_avg, exp_avg_sq.sqrt().add_(group['eps'])).view(-1)) * math.sqrt(prev_bias_correction2) / prev_bias_correction1
                    # Hypergradient descent of the learning rate:
                    group['lr'] += group['hypergrad_lr'] * h

I have also read lots of criticism about AmsGrad and haven't been able to yet get any improvement with that variant. Can I please learn your thoughts about that? FYI, two other techniques that I am currently experimenting with are Padam and QHAdam.

StopIteration

Hi, thank you for your share. Following your description, I try to use your code in my project, but I got the error in 'scheduler.batch_step()', this happened on this line 't_cur = self.t_epoch + next(self.batch_increment)'

Lower/Upper Bound for LR and Upper Bound decay

Hey there,

Nice update of the scheduler! It's really usefull!

Also nice would be to have the possibility to set following parameters: base_lr, max_lr and scale_fn

The scale_fn would be a function that decreases the max_lr:

by half after each period, while keeping the base lr constant.
scales max_lr by a factor gamma**(iterations)
or whatever lambda_function is given

Here an example implementation in Keras: https://github.com/bckenstler/CLR

I tried to hack this myself but I'm stucked. I'm not entirely sure which eta you use. (is this the one from weight decay?) And even if i'm right, I can't persist my hack because of the lambda function -.-

And also I'm not sure why, but in my case (Superresolution), when using cosine/arccosine my model diverges each times after restarting. (AdamW, wd=1e-6)
It happens with triangular too but not directly at the start of the second cycle.
Do you maybe have an idea where it could come from?

Thanks for your time!

Add License

Could you add a license to this project so that people can copy, modify, and redistribute? Thanks!

Persisting CosineAnnealingLRWithRestarts

Hi there,

Up to now all my scheduler inherited from _LRScheduler and so I didn't need to care too much about how it would be persisted.

For my checkpoints I define my state like this

 state = {
                                "model_state": model.state_dict(),
                                "optimizer_state": optimizer.state_dict(),
                                "scheduler_state": scheduler.state_dict(),
                            }

However with CosineAnnealingLRWithRestarts, I don't have this method state_dict().

I checked in the documentation the implementation of the state_dict()
https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#LambdaLR

and tried to extend your code myself, however I probably missed something.
Could you take a look?

Diffs are:

I inherit the class from _LRScheduler:

from torch.optim.lr_scheduler import _LRScheduler

class CosineAnnealingLRWithRestarts(_LRScheduler):

And rewrite the state_dict()



    def state_dict(self):
        """Returns the state of the scheduler as a :class:`dict`.

        It contains an entry for every variable in self.__dict__ which
        is not the optimizer.
        The learning rate lambda functions will only be saved if they are callable objects
        and not if they are functions or lambdas.
        """
        state_dict = {key: value for key, value in self.__dict__.items() if key not in ('optimizer', 'base_lrs', 'base_weight_decays')}
        state_dict['base_lrs'] = [None] * len(self.base_lrs)
        state_dict['base_weight_decays'] = [None] * len(self.base_weight_decays)

        for idx, fn in enumerate(self.base_weight_decays):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_weight_decays'][idx] = fn.__dict__.copy()
                state_dict['base_weight_decays'][idx] = fn

        for idx, fn in enumerate(self.base_lrs):
            if not isinstance(fn, types.FunctionType):
                # state_dict['base_lrs'][idx] = fn.__dict__.copy()
                state_dict['base_lrs'][idx] = fn


        return state_dict

    def load_state_dict(self, state_dict):
        """Loads the schedulers state.

        Arguments:
            state_dict (dict): scheduler state. Should be an object returned
                from a call to :meth:`state_dict`.
        """
        base_lrs = state_dict.pop('base_lrs')
        base_weight_decays = state_dict.pop('base_weight_decays')

        self.__dict__.update(state_dict)

        for idx, fn in enumerate(base_lrs):
            if fn is not None:
                self.base_lrs[idx] = fn        

        for idx, fn in enumerate(base_weight_decays):
            if fn is not None:
                self.base_weight_decays[idx] = fn

However I still get AttributeError: Can't pickle local object 'Tensor.__iter__.<locals>.<lambda>'
It would be terrific to be able to persist the state of this Scheduler :-)

scheduler.batch_step() AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

Z:\sp2\nhdeblur_pytorch>python "train.py" 1>"train_log.txt"
Traceback (most recent call last):
File "train.py", line 140, in
train(train_gen=trainloader, model=model, criterion=criterion, optimizer=optimizer, epoch=epoch)
File "train.py", line 115, in train
scheduler.batch_step()
File "Z:\sp2\nhdeblur_pytorch\cosine_scheduler.py", line 110, in batch_step
t_cur = self.t_epoch + next(self.batch_increment)
AttributeError: 'CosineLRWithRestarts' object has no attribute 'batch_increment'

optimizer = adamw.AdamW(model.parameters(), lr=opt.lr, weight_decay=0)
scheduler = cosine_scheduler.CosineLRWithRestarts(optimizer, batch_size=opt.batch_size, epoch_size=len(src_set), restart_period=5, t_mult=1.2)

def train(train_gen, model, criterion, optimizer, epoch):
    epoch_loss = 0
    for iteration, batch in enumerate(train_gen, 1):
        nr = batch[0].to(device)
        hr = batch[1].to(device)
        
        optimizer.zero_grad()
        loss = criterion(model(nr), hr)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.batch_step()
    
        if iteration % 1000 == 0:
            print('===> Epoch[{e}]({it}/{dl}): Loss{l:.4f};'.format(e=epoch, it=iteration, dl=len(train_gen), l=loss.cpu()))
            
    Current_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
    epoch_loss_average = epoch_loss / len(train_gen)
    print('===> {ct} Epoch {e} Complete: Avg Loss: {avg_loss:.4f}, Sum Loss: {sum_loss:.4f}'
          .format(e=epoch, avg_loss=epoch_loss_average, sum_loss=epoch_loss, ct=Current_time))

LR Scheduler help

Can you please help me write my own learning rate scheduler? I mean I couldn't find much docs on how to write one in Pytorch. I went through this mxnet guide, and came to the conclusion that if I do the following:

lrs = [scheduler(i+1) for i in range(epochs*batch_size)]
iters = 1
for i in range(epochs):
	for data,label in train:
		... # backward and calculate loss
		for group in optimizer.param_groups:
			group['lr'] = lrs[iters]
		optimizer.step()
		iters+=1

What is the more elegant way of doing it?

mpyrozhok / adamwr Goto Github PK

adamwr's Introduction

AdamW optimizer and cosine learning rate annealing with restarts

Cyclical Learning Rates

Example:

adamwr's People

Contributors

Stargazers

Watchers

Forkers

adamwr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs