openlmlab / lomo Goto Github PK

LOMO: LOw-Memory Optimization

License: MIT License

Python 99.83% Shell 0.17%

lomo's Introduction

This is the implementation for Full Parameter Fine-Tuning for Large Language Models with Limited Resources and AdaLomo: Low-memory Optimization with Adaptive Learning Rate.

News

LOMO and AdaLomo were integrated in transformers and accelerate.
PyPI package lomo-optim was released.
LOMO and AdaLomo were integrated in CoLLiE library, which supports Collaborative Training of Large Language Models in an Efficient Way.

Usage

You can install lomo-optim from PyPI using pip.

pip install lomo-optim

Then, import Lomo or AdaLomo.

from lomo_optim import Lomo
from lomo_optim import AdaLomo

The usage of Lomo and AdaLomo is similar but not the same as PyTorch's optimizers (example). We recommend to use AdaLomo without gradnorm to get better performance and higher throughput.

LOMO: LOw-Memory Optimization

In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. Our approach enables the full parameter fine-tuning of a 7B model on a single RTX 3090, or a 65B model on a single machine with 8×RTX 3090, each with 24GB memory.

Implementation

Our implementation relies on injecting hook functions into PyTorch's backward pass. As depicted in the figure, we register a customized hook function for each parameter. When the gradient of a parameter is computed (prior to writing it to the .grad attribute), its corresponding hook function is invoked. For more information about hook functions and the backward pass of the autograd graph, please refer to PyTorch's documentation. In summary, during the backward pass, we go through a tensor and its grad_fn, write the gradient into the .grad attribute, and then pass to the next tensor.

Our customized hook function scans all the parameters, updating a parameter if its .grad attribute is not empty, and then clears and frees the .grad attribute. Since the hook function for a parameter is called before its .grad attribute is set, the .grad attribute of the last parameter in the autograd graph is not ready when the last hook function is invoked. Therefore, we perform an additional scan to update the last parameter.

The code for LOMO is in lomo folder.

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

In this work, we examined the distinctions between the LOMO and Adam optimization techniques and introduce AdaLomo, which provides an adaptive learning rate for each parameter and utilizes grouped update normalization while maintaining memory efficiency. AdaLomo achieves results comparable to AdamW in both instruction-tuning and further pre-training with less memory footprint.

The code for AdaLomo is in adalomo folder.

Citation

@article{lv2023full,
  title={Full Parameter Fine-tuning for Large Language Models with Limited Resources},
  author={Lv, Kai and Yang, Yuqing and Liu, Tengxiao and Gao, Qinghui and Guo, Qipeng and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2306.09782},
  year={2023}
}
@article{lv2023adalomo,
  title={AdaLomo: Low-memory Optimization with Adaptive Learning Rate},
  author={Lv, Kai and Yan, Hang and Guo, Qipeng and Lv, Haijun and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2310.10195},
  year={2023}
}

lomo's People

Contributors

Stargazers

Watchers

lomo's Issues

What is the difference from official PyTorch DDP hooks?

It is a classical idea to overlap the backward pass and the optimization step. PyTorch supports this overlapping in DDP and FSDP. For example, here are hooks in DDP https://github.com/pytorch/pytorch/tree/main/torch/distributed/algorithms/ddp_comm_hooks

How does this project (https://arxiv.org/pdf/2306.09782.pdf) differ from https://arxiv.org/pdf/2306.09782.pdf? Thanks.

Functions to measure the memory usage

您好，感谢您的工作，我这里有两个问题。
1.您是否能提供衡量类似于figure2和table2中memory usage的函数，或者给予一些提示如何去实现。nvidia-smi好像不能正确的反应真正的显存占用。
2.table2中的memory是指的每个gpu平均的呢还是总共的显存占用？

How to calculate the used GPU memory for each part as in the paper?

Hi @QipengGuo @KaiLv69 @ayyyq

Thanks for the nice work, I am wondering how to calculate the detailed used GPU memory as illustrated in the paper, such as the results in the Table 1. What tools did you use for the calculations?

是否支持量化的模型呀？

你好，请问是否支持量化的模型，比如gptq？

如果可以的话，按照比例计算的话，我有8张24g的显卡的话，用流水线并行，是不是可以lora 175b版本量化模型了？

谢谢~

Memory Usage continues to grow

您好，我将LOMO的核心训练代码加入到了我的脚本中，可以正常训练，但是遇到了显存随训练step不断增加的情况，我在lom_trainer中打了一些查看显存占用的日志，挑选其中一个step的占用情况写明如下（我用print_memory_usage函数打印显存占用情况，结果在注释中），希望您给一些排查的方向和建议，可能是什么原因导致了这种情况的发生？：

`with tqdm.tqdm(self.train_dataloader, disable=not self.allow_print) as tqb:
for step, batch in enumerate(tqb, start=1): # transformer.trainer:1899
rank0_print('-' * 100)
rank0_print(f"step {step} : {batch['input_ids'].shape}")
self.model.train()
print_memory_usage() # 5844.15 MiB
rank0_print(f"compute outs ...")
outs = self.model(
input_ids=batch['input_ids'].cuda(),
attention_mask=batch['attention_mask'].cuda(),
)
print_memory_usage() # 6310.91 MB
rank0_print(f"Line 100 get_loss ...")
loss = get_loss(outs.logits, batch['labels'], self.training_args.clip_loss_value)
print_memory_usage() # 6342.11 MB
# update the learning rate
self.global_step = self.num_steps_per_epoch * epoch + step
self.lr = self.lr_scheduler.step(self.global_step)
if self.training_args.clip_grad_norm is not None and self.training_args.clip_grad_norm > 0:
rank0_print('clip_grad_norm.....')
self.optimizer.grad_norm(loss)
print_memory_usage() # 6142.11 MB

        if self.optimizer.loss_scaler and self.optimizer.loss_scaler.has_overflow_serial:
            rank0_print(f"Gradient overflow, skipping step {self.global_step}")
            self.model.optimizer.get_param_coordinator(training=True).reset_step()
            print_memory_usage()
            tqb.set_postfix({'loss': loss.item()})
            if self.allow_print:
                rank0_print({
                        'train/loss': loss.item(),
                        'train/learning_rate': self.lr,
                        'train/global_step': self.global_step,
                    })
            continue

        else:
            rank0_print('Line 129 reset_step.....')
            self.model.optimizer.get_param_coordinator(training=True).reset_step()
            print_memory_usage()  # 6142.11 MB

        # 第二次forward
        rank0_print('第二次forward.....')
        outs = self.model(
            input_ids=batch['input_ids'].cuda(),
            attention_mask=batch['attention_mask'].cuda(),
        )
        loss = get_loss(outs.logits, batch['labels'], self.training_args.clip_loss_value)
        print_memory_usage()  # 6577.68 MB

    rank0_print('fused_backward.....')
    self.optimizer.fused_backward(loss, self.lr)
    print_memory_usage()      # 6115.92 MB
    rank0_print('loss detach.....')
    loss = loss.detach()      # 6115.92 MB
    print_memory_usage()      # 6115.92 MB
    rank0_print('optimizer reset step.....')
    self.model.optimizer.get_param_coordinator(training=True).reset_step()
    print_memory_usage()      # 6115.92 MB

    tqb.set_postfix({'loss': loss.item()})
    if self.allow_print:
        rank0_print({
                'train/loss': loss.item(),
                'train/learning_rate': self.lr,
                'train/global_step': self.global_step,
            })

    if self.training_args.save_strategy == 'steps' and self.global_step % self.training_args.save_steps == 0:
        self.save_model(self.global_step)

    # todo: check why memory increase
    rank0_print('del vars, empty cache.....')
    del loss, outs, batch
    torch.cuda.empty_cache()
    print_memory_usage()       # 6084.66 MB`

LOMO+QLoRA简单更改后的报错

Hi!
使用LOMO+LoRA时可以一切正常地运行。
但是！

我想将LOMO与QLoRA结合在一起，
于是：
我在您lomo_lora相关的源码中的模型代码加入了量化的功能：
train_lomo_lora.py:

并且在LoRA config后添加了：

因为deepspeed暂时不支持量化，我注释并修改了所有跟deepspeed相关的配置：
train_lomo_lora.py:

lomo_lora_trainer.py:

之后修改run.sh:
python src/train_lomo_lora.py config/args_lomo_lora.yaml

运行几次添加了需要的库后却报错如下：
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /root/model/utils/CustomLOMO/src/train_lomo_lora.py:181 in <module> │ │ │ │ 178 │ │ 179 │ │ 180 if __name__ == "__main__": │ │ ❱ 181 │ train() │ │ 182 │ │ │ │ /root/model/utils/CustomLOMO/src/train_lomo_lora.py:177 in train │ │ │ │ 174 │ │ compute_metrics=compute_metrics, │ │ 175 │ │ optimizers={'model_parameters': peft_params}, │ │ 176 │ ) │ │ ❱ 177 │ trainer.train() │ │ 178 │ │ 179 │ │ 180 if __name__ == "__main__": │ │ │ │ /root/model/utils/CustomLOMO/src/lomo_lora_trainer.py:197 in train │ │ │ │ 194 │ │ │ with tqdm.tqdm(self.train_dataloader, disable=not self.allow_print) as tqb: │ │ 195 │ │ │ │ for step, batch in enumerate(tqb, start=1): │ │ 196 │ │ │ │ │ self.model.train() │ │ ❱ 197 │ │ │ │ │ outs = self.model( │ │ 198 │ │ │ │ │ │ input_ids=batch['input_ids'].cuda(), │ │ 199 │ │ │ │ │ │ attention_mask=batch['attention_mask'].cuda(), │ │ 200 │ │ │ │ │ ) │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/peft/peft_model.py:678 in forward │ │ │ │ 675 │ ): │ │ 676 │ │ peft_config = self.active_peft_config │ │ 677 │ │ if not isinstance(peft_config, PromptLearningConfig): │ │ ❱ 678 │ │ │ return self.base_model( │ │ 679 │ │ │ │ input_ids=input_ids, │ │ 680 │ │ │ │ attention_mask=attention_mask, │ │ 681 │ │ │ │ inputs_embeds=inputs_embeds, │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/models/llama/modeling_llam │ │ a.py:688 in forward │ │ │ │ 685 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │ │ 686 │ │ │ │ 687 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │ │ ❱ 688 │ │ outputs = self.model( │ │ 689 │ │ │ input_ids=input_ids, │ │ 690 │ │ │ attention_mask=attention_mask, │ │ 691 │ │ │ position_ids=position_ids, │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/models/llama/modeling_llam │ │ a.py:570 in forward │ │ │ │ 567 │ │ │ │ │ │ │ 568 │ │ │ │ │ return custom_forward │ │ 569 │ │ │ │ │ │ ❱ 570 │ │ │ │ layer_outputs = torch.utils.checkpoint.checkpoint( │ │ 571 │ │ │ │ │ create_custom_forward(decoder_layer), │ │ 572 │ │ │ │ │ hidden_states, │ │ 573 │ │ │ │ │ attention_mask, │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/checkpoint.py:249 in │ │ checkpoint │ │ │ │ 246 │ │ raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwar │ │ 247 │ │ │ 248 │ if use_reentrant: │ │ ❱ 249 │ │ return CheckpointFunction.apply(function, preserve, *args) │ │ 250 │ else: │ │ 251 │ │ return _checkpoint_without_reentrant( │ │ 252 │ │ │ function, │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/autograd/function.py:506 in apply │ │ │ │ 503 │ │ if not torch._C._are_functorch_transforms_active(): │ │ 504 │ │ │ # See NOTE: [functorch vjp and autograd interaction] │ │ 505 │ │ │ args = _functorch.utils.unwrap_dead_wrappers(args) │ │ ❱ 506 │ │ │ return super().apply(*args, **kwargs) # type: ignore[misc] │ │ 507 │ │ │ │ 508 │ │ if cls.setup_context == _SingleLevelFunction.setup_context: │ │ 509 │ │ │ raise RuntimeError( │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/checkpoint.py:107 in │ │ forward │ │ │ │ 104 │ │ ctx.save_for_backward(*tensor_inputs) │ │ 105 │ │ │ │ 106 │ │ with torch.no_grad(): │ │ ❱ 107 │ │ │ outputs = run_function(*args) │ │ 108 │ │ return outputs │ │ 109 │ │ │ 110 │ @staticmethod │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/models/llama/modeling_llam │ │ a.py:566 in custom_forward │ │ │ │ 563 │ │ │ │ def create_custom_forward(module): │ │ 564 │ │ │ │ │ def custom_forward(*inputs): │ │ 565 │ │ │ │ │ │ # None for past_key_value │ │ ❱ 566 │ │ │ │ │ │ return module(*inputs, output_attentions, None) │ │ 567 │ │ │ │ │ │ │ 568 │ │ │ │ │ return custom_forward │ │ 569 │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/models/llama/modeling_llam │ │ a.py:292 in forward │ │ │ │ 289 │ │ hidden_states = self.input_layernorm(hidden_states) │ │ 290 │ │ │ │ 291 │ │ # Self Attention │ │ ❱ 292 │ │ hidden_states, self_attn_weights, present_key_value = self.self_attn( │ │ 293 │ │ │ hidden_states=hidden_states, │ │ 294 │ │ │ attention_mask=attention_mask, │ │ 295 │ │ │ position_ids=position_ids, │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/accelerate/hooks.py:165 in new_forward │ │ │ │ 162 │ │ │ with torch.no_grad(): │ │ 163 │ │ │ │ output = old_forward(*args, **kwargs) │ │ 164 │ │ else: │ │ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │ │ 166 │ │ return module._hf_hook.post_forward(module, output) │ │ 167 │ │ │ 168 │ module.forward = new_forward │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/transformers/models/llama/modeling_llam │ │ a.py:194 in forward │ │ │ │ 191 │ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: │ │ 192 │ │ bsz, q_len, _ = hidden_states.size() │ │ 193 │ │ │ │ ❱ 194 │ │ query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self. │ │ 195 │ │ key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.he │ │ 196 │ │ value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self. │ │ 197 │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in │ │ _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /root/miniconda3/envs/train/lib/python3.10/site-packages/peft/tuners/lora.py:565 in forward │ │ │ │ 562 │ │ │ │ self.unmerge() │ │ 563 │ │ │ result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self. │ │ 564 │ │ elif self.r[self.active_adapter] > 0 and not self.merged: │ │ ❱ 565 │ │ │ result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self. │ │ 566 │ │ │ │ │ 567 │ │ │ x = x.to(self.lora_A[self.active_adapter].weight.dtype) │ │ 568 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: mat1 and mat2 shapes cannot be multiplied (327x4096 and 1x8388608)

返回再看源码感觉是lomo_lora_trainer里

的问题。

请问如果需要完全剔除deepspeed并想要成功运行LOMO+QLoRA，我应该做何修改？

Memory consumption first grows up then falls down.

Dear authors, it is nice to see this amazing work. When I run this code, I found an interesting phenomenon that when loading the model, it occupies more GPU memory. And when the training starts, the GPU memory consumption will be stabilized at a value slightly lower than the former.

For example, when I run openlm-research/open_llama_7b with deepspeed --master_port "$port" --include localhost:"$CUDA_VISIBLE_DEVICES" src/train_lomo.py config/args_lomo.yaml on a single V100 GPU with batch_size set to 1 and others left to the default values, I find that the GPU memory consumption is first 18588MB before training starts, and during training, it is stabilized at 15933MB. Can you provide more information about this phenomenon? Many thanks!

Is LOMO capable of pre-training a LLM from scratch as well?

Customized loss value

full parameter update里面，我最近在试一种新的loss function，就是在原有的next token prediction上面加一个regularized term，希望某些layer的weights能尽可能小。可是总会遇到一些奇怪的bug。下面是我加的代码和遇到的error：

比如在lomo_trainer.py中：

lamda, regularization = 1, torch.tensor(0, requires_grad=True, dtype=torch.float32)
self.model.train()
for name, param in self.model.named_parameters():
    if "self_attn.q_proj" in name:
        with GatheredParameters(param):
            regularization = regularization + torch.mean(param)
...
loss = get_loss(outs.logits, batch['labels'], self.training_args.clip_loss_value) + lamda * regularization

可是这样做完，会造成lomo.py里面grad_norm()的loss.backward(retain_graph=True)产生RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1的错误。我猜是backward的时候找不到我新加的那些layer的weights。想请问一下，该怎么解决这个bug或者有没有更好的implementation？

非常感谢！

选取bloom-1b7作为model和wic作为数据集出现IndexError: tuple index out of range

不知道是否是fp16和fp32的问题？

│ LOMO/src/lomo_trainer.py:210 in train │
│ │
│ 207 │ │ │ │ │ │ self.eval(self.global_step, epoch, self.eval_dataset[prefix], se │
│ 208 │ │ │ │ │ │ │ │ prefix) │
│ 209 │ │ │ │ else: │
│ ❱ 210 │ │ │ │ │ self.eval(self.global_step, epoch, self.eval_dataset, self.eval_data │
│ 211 │ │
│ 212 │ def eval( │
│ 213 │ │ │ self, │
│ │
│ LOMO/src/lomo_trainer.py:237 in eval │
│ │
│ 234 │ │ │ │ │ if self.training_args.predict_with_generate: │
│ 235 │ │ │ │ │ │ pred = self.generate_step(batch) │
│ 236 │ │ │ │ │ else: │
│ ❱ 237 │ │ │ │ │ │ pred = self.eval_step(batch) │
│ 238 │ │ │ │ │ all_preds = pred if all_preds is None else all_preds + pred │
│ 239 │ │ │ │
│ 240 │ │ │ all_preds_gather = [None for _ in range(self.training_args.world_size)] │
│ │
│ LOMO/src/lomo_trainer.py:263 in eval_step │
│ │
│ 260 │ │ """ │
│ 261 │ │ used for classification or multi-choice qa tasks in eval() │
│ 262 │ │ """ │
│ ❱ 263 │ │ outs = self.model(batch['input_ids'].cuda(), batch['attention_mask'].cuda()) │
│ 264 │ │ # Shift so that tokens < n predict n │
│ 265 │ │ shift_logits = outs.logits[..., :-1, :].contiguous() │
│ 266 │ │ shift_labels = batch['labels'][..., 1:].cuda().contiguous()

time cost of 7b model training compared to AdamW

Does it means LOMO is 11 times faster than AdamW?

看了下lomo的代码实现，训练的速度会很慢吗？

看了下注册hook的方式，这是否相当于每次backpropagation的时候，每触发一个变量的求导，就要for循环遍历一次全部参数。对于一个几十亿参数量的模型。这种计算方式会不会导致训练的速度非常慢？

公式4疑问

公式4里面的第二项L，是否应该是L对f的导数？

pytorch的loss 的backward不是会把所有相关参数的grads算好并存在.grad中吗？

这样的话，grads的空间不是已经申请过了，这部分空间是怎么节省下来的呢？

关于微调llama-65b的疑问

非常nice的工作！不过有几个问题想询问一下。

在代码train_lomo.py中是使用transformers自带的AutoModelForCausalLM来加载的，然后在trainer中再使用的deepspeed来处理模型。就是应该是先让模型在每张卡加载上，再使用zero3来优化。但是3090应该是在加载65B的过程中就OOM了

想问一下这里说的8张3090能全量微调65B，是使用了collie中的张量并行+LOMO吗？还是就用这个repo的代码就能实现了

Thanks！

The implementation of LOMO is not released?

Hi Authors,

Thank you for sharing your work. I was wondering about the implementation of LOMO. I can't see .train() implementation in the trainer. Could you please share the implementation of LOMO if this is possible?

Best regards,
Abdelrahman.

about torch.stack(self.grad_norms)

这里运行的时候报了 RuntimeError: stack expects a non-empty TensorList 的错误，看了下代码，确实是空的，这个地方要怎么解决呢？

4张3090能训练llama13B么，我做了尝试但是失败了

Testing with P100 on Kaggle

I have a question, is it possible to train a GPT2 model with LOMO?
And it seems that DeepSpeed is required to use LOMO right? (what I wonder is even if I only have 1 GPU 4090 will DeepSpeed still be effective?)

Is LOMO a concurrent work of the official implementation?

Dear authors, I notice that in the official implementation of PyTorch, there is an internal implementation of optimizer_hook registry merged into the main branch in March, 2023. It will be very kind of you to shed light on the following two points:

Current implementation of LOMO does not support nontrivial learning rate scheduler while the official implemention seems can; so is there any compatibility-memory-efficiency trade-off between the opt.step() style in the official one and the implementation of LOMO?
According to my understanding, LOMO is compatible with model parallelism by and only by tensor parallelism. Is such an understanding reasonable and whether LOMO has an edge over the official implementation in torch.distributed.optim in terms of model parallelism?

Thanks!

ModuleNotFoundError: No module named 'rich' after ' python -m pip install rich'

(gh_LOMO) ub2004@ub2004-B85M-A0:~/llm_dev/LOMO$ deepspeed --master_port "1234" --include localhost:"0" src/train_lomo.py config/args_lomo.yaml
[2023-06-28 02:01:44,116] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-28 02:01:44,142] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=1234 --enable_each_rank_log=None src/train_lomo.py config/args_lomo.yaml
[2023-06-28 02:01:46,243] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-06-28 02:01:46,243] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-06-28 02:01:46,243] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-06-28 02:01:46,243] [INFO] [launch.py:247:main] dist_world_size=1
[2023-06-28 02:01:46,243] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
PYTHON_PATH /home/ub2004/llm_dev/LOMO
Traceback (most recent call last):
File "/home/ub2004/llm_dev/LOMO/src/train_lomo.py", line 20, in
from log import print
File "/home/ub2004/llm_dev/LOMO/log/init.py", line 6, in
from .logger import logger
File "/home/ub2004/llm_dev/LOMO/log/logger.py", line 32, in
from rich.logging import RichHandler
ModuleNotFoundError: No module named 'rich'

llama-33B/llama-65B均报OOM，8*V100跑不起来怎么回事呢？

环境：8 * V100 (32G)
执行run.sh
【错误log】

【LOMO模式】
args_lomo.yaml配置：

ds_config.json配置：

【LOMO+LORA模式】
args_lomo_lora.yaml配置：

ds_config_lora.json

the model weight seems not been updated

When i evalutae it on the dev dataset during training iterations, the performance (PPL) does not change after training one epoch. When I look into the trainer code, there seems no code to update the model weigth. Only the code to update the optimizer. Is it expected (only update optimizer but not weight --- i.e. loss.backword)? Thanks!

wandb permission

您好，感谢您的开源。
我在运行sh run.sh的时候出现了以下错误，我需要加入什么来获得权重授权吗？

wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)
Problem at: src/train_lomo.py 78 train
wandb: ERROR It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)
Traceback (most recent call last):
File "src/train_lomo.py", line 125, in
train()
File "src/train_lomo.py", line 78, in train
config=wandb_config
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1171, in init
raise e
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 1152, in init
run = wi.init()
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 768, in init
raise error
wandb.errors.CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.(Error 404: Not Found)
CommError: It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no
issues with your networking setup.(Error 404: Not Found)

LORA+LOMO distributed learning

感谢您的工作，我这有个问题是为什么现有的提供的lomo+lora的代码不支持分布式训练。

我注释掉了上面两行，程序依旧可以运行。

4070ti有机会训练一下吗

OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.73 GiB total capacity; 9.55 GiB already allocated; 43.31 MiB free; 10.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

请教个问题，LLM 训练会存在 micro-batch 之间需要累积梯度的场景，这种场景也会有优化吗？

如题

Some confusion about the method of the paper

大佬您好，传统梯度反向链式传播会用到上一步的梯度计算结果，但文中的方法在更新后不存储梯度，是否意味着后续梯度计算中多了重复的计算，类似时间换空间的做法。这么理解正确吗？

小学习率问题

我在使用小学习率5e-5,接近一万条特殊格式的训练数据进行训练，发现loss不降，是哪里设置的有问题吗？

关于代码理解和显存占用的问题

您好，非常感谢您的工作，在查看您的代码和论文的过程中，我有几个不成熟的问题，能否请您解答一下我的疑惑。

在您的代码目录 LOMO/src/lomo.py fused_backward函数第172行中loss.backward()的操作似乎没有做特别的优化，它是否会进行反向传播，同时计算出网络中所有结点的梯度数值？如果是这样，它似乎与论文中，只迭代性的计算上一层的梯度的想法相互矛盾了，请问这里应该怎样理解呢？
在您的代码目录 LOMO/src/lomo_lora_trainer.py class LOMOLoRATrainer的初始化方法第90行中，您调用到了AdamW优化器，这是否说明您的LOMO方法，只是分层调用已有优化器的更高层次方法，是可以兼容包括SGD和AdamW在内的优化器的优化方法。只是这种方法在和SGD优化器结合使用时效果更好，和Adamw联合使用时也可以工作，是否可以这样理解呢？
在https://github.com/OpenLMLab/LOMO/issues/38中，您提到table2中的memory占用是指各GPU在时间上的平均显存占用，那么请问下各GPU的峰值显存占用相比之前是否有明显优化？是否有相关的具体实验数据呢？
期待您的回复，谢谢~

更充分实验，与Adam的实验效果进行比较

论文中比较了zero shot、lora和lomo的实验效果，但是缺少adam或者adamw的实验效果。
请问lomo和adam之间的差距有多大，是否有进行实验，期待你的回复，感谢~

LlaMA-7B + LoRA在16GB的V100上OOM

尊敬的作者您好，我按照库中的配置，将per_device_train_batch_size和per_device_eval_batch_size都设置为1，发现在单卡16GB的V100上运行lomo_lora_trainer.py训练LlaMA-7B会出现OOM的问题。

具体配置如下

# model
model_name_or_path: 'openlm-research/open_llama_7b'
# data
dataset_name: 'wic'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
peft_type: 'lora'
lora_only: false
hf_learning_rate: 0.0005
hf_weight_decay: 0
hf_lr_scheduler_type: 'linear'
hf_warmup: 0.05
tag: 'lora-qv-r2-lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config_lora.json'
do_train: true
do_eval: true
evaluation_strategy: 'epoch'
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
learning_rate: 0.005
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.05
clip_grad_norm: 1.0
#clip_grad_value: 1.0
#clip_loss_value: 5.0
log_level: 'info'
logging_steps: 1
# please set `resume_from_checkpoint` to load checkpoints. check `merge_llama_with_lora.py` first.
#resume_from_checkpoint: 'outputs/wic_7B_lora-qv-r2-lomo/output_lr0.005_bs16_warmup0.05_clipnorm1.0/checkpoint-0/merge_weights'
# please set `save_strategy` (`no`, `epoch`, `steps`) and `save_total_limit` (the max amount of checkpoints) to save checkpoints.
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
optim: 'sgd'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: false
lora_r: 2

顺便说一下，我在按照上述同样的配置，不用lora的情况下，在16GB的V100上通过LOMO训练LlaMA-7B将占用15933MB的显存，和论文中的结果似乎不太一样。请问是哪里我配置得不对吗？

CLIP梯度和梯度overflow的影响

你好，非常感谢你们的出色工作。
我有两个疑惑：对于LomoTrainer中额外使用的这两个东西是否非常重要？对性能的影响有多大呢？

为什么LOMO并没有火起来呢？

个人感觉全参数FT还是会比LoRA这种Adapter的效果要好的，那为什么LOMO没有火起来呢？个人已经试过2张24GB的显卡用LOMO FT一个7B的BLOOM，感觉整体流程还蛮丝滑的，为什么在各个平台搜不到太多用LOMO的人呢，好奇怪。

LOMO是否支持bfloat16模型的训练？

非常感谢您的代码。对于float16类型的Llama-7B模型，代码运行正常。注意到代码支持bfloat16类型，我们把Llama模型转换为bfloat16类型，目前代码可以正常加载bfloat模型，也可以正常前向传播并计算loss值，但是反向传播时却报错：
代码运行至：self.optimizer.grad_norm(loss)，报错：AttributeError: 'NoneType' object has no attribute '_has_inf_or_nan'

请问是什么原因？

I can not find the weights after training

As described , I can not find the weights on my server after traing.

can you provide the running config of 65b models?

Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine?
It would be very nice if you add config/args_lomo.yaml and config/ds_config.json for 65b models.
Thanks.

Question about Memory usage (GB) when training LLaMA-7B under different settings.

Thanks for your amazing work.

Since I am not very familiar with the memory usage computation, I want to know if you can put more details about the Table 1 into the Appendix.

I can obtain the memory usage of Params = 12.x GB by my own.
But I am very confusing about the memory usage of Activations = 45.x GB.

I read the paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In the section 3.2, they said that

Activations can take up a significant amount of memory [7] during training. 
As a concrete example, the 1.5B parameter GPT-2 model trained with sequence length of 1K and batch size of 32 
requires about 60GB of memory. 

The activation memory of a transformer-based model is proportional to 
the number of transformer layers × hidden dimensions × sequence length × batch size. 

For a GPT-2 like architecture the total activations is about 
12 × hidden dim × batch × seq length × transformer layers.

The following is my computation:

GPT-2 XL:
memory in fp16 : (12 * 1600 * 32 * 1000 * 48) / 1024 / 1024 / 1024 * 2 = 54.9x GB

LLaMA-7b
memory in fp16 : (12 * 4096 * 8 * 512 * 32) / 1024 / 1024 / 1024 * 2 = 12.0 GB

Can you explain why you give a 45.x GB not 12 GB memory usage for activations?

a bug found in save_model of LOMOTrainer

我使用lomo（和zero3）在8张NVIDIA 3090 GPU上微调chatglm2-6b，并使用LOMOTrainer的save_model方法保存。重新加载模型checkpoint后，我发现模型测出来的验证集loss与训练结束时测出来的不一样。我参考deepspeed官方保存模型的代码，重写了save_model（重写的代码如下），发现这个bug解决了。这说明原来版本的save_model有bug，但我还没有找到具体出错原因。
I used LOMO (and zero3) to fine-tune chatglm2-6b on 8 NVIDIA 3090 GPUs and saved it using LOMOTrainer's save_model method. After reloading the model checkpoint, I found that the validation loss measured by the model differed from the validation loss measured at the end of training. I referred to the DeepSpeed official code, rewrote save_model (rewritten code below), and found this bug resolved. This indicates that the original version of save_model has a bug, but I have not yet figured out the specific cause of the error.

    def save_model(self, index):
        if self.training_args.local_rank in [-1, 0]:
            checkpoint_dir = sorted(Path(self.training_args.output_dir).glob("checkpoint-*"))
            if len(checkpoint_dir) >= self.training_args.save_total_limit:
                shutil.rmtree(checkpoint_dir[0], ignore_errors=True)
        torch.distributed.barrier()

        if self.training_args.resume_step:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index+self.training_args.resume_step}")
        else:
            output_dir = os.path.join(self.training_args.output_dir, f"checkpoint-{index}")
        if not os.path.exists(output_dir):
            os.makedirs(output_dir, exist_ok=True)

        state_dict = OrderedDict() if torch.distributed.get_rank() == 0 else None
        shared_params = {}

        # Prepare for checkpoint save by ensuring all parameters are partitioned
        self.model.optimizer.partition_all_parameters()

        with deepspeed.zero.GatheredParameters(list(self.model.module.parameters()), modifier_rank=0):
            if torch.distributed.get_rank() == 0:
                for name, param in self.model.module.named_parameters():
                    if param is None:
                        continue
                    # can't rely on param.data_ptr() as it will be reused as weights gets
                    # gathered and reduced, but param.ds_id is unique across all zero weights
                    # (and shared params will have the same param.ds_id)
                    if param.ds_id in shared_params:
                        # shared weights
                        #print(f"`{key}` is shared with `{shared_params[param.ds_id]}`")
                        state_dict[name] = state_dict[shared_params[param.ds_id]]
                    else:
                        state_dict[name] = param.detach().cpu()
                        shared_params[param.ds_id] = name
                    #print(f"param {param.ds_id} {param.shape} {key} ")

                # now buffers - not sure if need to take care of potentially shared weights here
                for name, buf in self.model.module.named_buffers():
                    if (buf is not None and name not in self.model.module._non_persistent_buffers_set):
                        state_dict[name] = buf.detach().cpu()

        if len(self.model.optimizer.persistent_parameters) > 0:
            self.model.optimizer.persistent_parameters[0].all_gather(self.model.optimizer.persistent_parameters)

        if torch.distributed.get_rank() == 0:
            torch.save(state_dict, os.path.join(output_dir, 'pytorch_model.bin'))

        torch.distributed.barrier()

Performance Model after Full Fine-tuning by LOMOTrainer

I have completely fine-tuning my model by LOMO, more details i'm using bloomz-7b1-mt as the backbone and finetuning it on Alpaca Instruction Dataset. I'm using my own data processing pipeline and just replace the Trainer to your LOMOTrainer. However, the result I receive when using your directory is quite bad, my inference result is generate many messy characters while full-finetuning or LoRA on normal way not show that, and I ensure the messy character not in my training Dataset. I think the problem is on the Optimizer and Config when training. Can you see a little bit about my scripts training?

CUDA_VISIBLE_DEVICES=0,1 WANDB_DISABLED=True deepspeed --master_port=19121 train.py \
    --deepspeed config/ds_config.json \
###my config
    --model_name_or_path bigscience/bloomz-7b1-mt \
    --data_path /home/jovyan/vol-1/dat/data/alpaca/alpaca_vi_expert_conversation.json \
    --output_dir ~/vol-1/dat/checkpoints/bloom-v5.2-lomo-full-ft \
    --model_max_length 1024 
###your config
    --do_train True \
    --do_eval False \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 10000 \
    --save_total_limit 10 \
    --learning_rate 3e-2 \
    --weight_decay 0 \
    --warmup 0.01 \
    --lr_scheduler_type "linear" \
    --clip_grad_norm  1.0 \
    --logging_steps 100 \

I'm not using your own file train.py, but just modify the data processing pipeline, replace your 'DataArgument', 'ModelArgument' but keep 'MyTrainingArgument' to LOMOTrainer.
And my deepspeed config file:

{
    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
             "device": "cpu",
             "pin_memory": true
         },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e8,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false
}

The reason I dont add the evaluate dataset to tracking the model training rightway cause the Instruction Following finetuning is not have a clearly benchmark.
For all important config like learning rate, weight_decay, lr_scheduler_type,... I take it from your config sample file, but I see that in args_lomo.yaml, you dont set the optimize function but args_lomo_lora.yaml, it is 'SGD'. So as my thought, when training without LoRA, you are using the default optimize function of Seq2SeqArguments is Adam and when change to LOMO + LoRA, it will be SGD for pretrained weights and AdamW for LoRA. Am I understand it right?

The final question, you can help me address issue when full finetuning when the result is generate more messy character and optimize the config training of mine? I'm using 2 A100s 40GB.

我使用了Resnet50+LOMO优化器，使用cpu去跑，系统内存相比sgd 没有任何变化，请问合理吗

请问自定义Dataset只能是classification的数据集吗

如果是类似数学主观题Seq2Seq的主观题目而非分类的选择题，这种自定义Dataset也能被支持吗？

Train with other datasets collator/loader

Hello friend!

First of all, thanks for your amazing work!!

If I use another data collator/dataset loader, would I still be able to train using the LOMO trainer class?

batch size开2后一直提示gradient overflow。。

batch_size 为1的时候很正常地训练着；
之前都没试过bsz开2，今天试了下却出现下图的情况真不知道怎么回事了。。

type object 'torch._C._distributed_c10d.ReduceOp' has no attribute 'AVG'

Traceback (most recent call last):
  File "src/train_lomo.py", line 136, in <module>
    train()
  File "src/train_lomo.py", line 129, in train
    trainer.train()
  File "/workspace/LOMO/src/lomo_trainer.py", line 116, in train
    self.optimizer.grad_norm(loss)
  File "/workspace/LOMO/src/lomo.py", line 186, in grad_norm
    loss.backward(retain_graph=True)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
  File "/workspace/LOMO/src/lomo.py", line 117, in func
    torch.distributed.all_reduce(p.grad, op=torch.distributed.ReduceOp.AVG, async_op=False)
AttributeError: type object 'torch._C._distributed_c10d.ReduceOp' has no attribute 'AVG'

https://github.com/OpenLMLab/LOMO/blob/ee7d431344569bc69ff7283b70141b5c6d66c901/src/lomo.py#L117C23-L117C23

请问是我的torch版本的问题吗，这个怎么处理呢， torch版本1.10.0
感谢您的回复

之前报

ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

于是我将

tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        use_fast=False,
        padding_side='left'
    )

改成了

tokenizer = LlamaTokenizer.from_pretrained(
        model_args.model_name_or_path,
        use_fast=False,
        padding_side='left'
    )

和这个有关系么

我理解是分批次进GPU内存再计算，而速度怎么做到没有下降的？太强了

Key Error: LOCAL_RANK

I'm trying to adapt the optimizer for another application, but I get the following error at src/lomo.py

`
---> 30 self.local_rank = int(os.environ["LOCAL_RANK"])
31 self.world_size = dist.get_world_size()
32 self.clip_grad_norm = clip_grad_norm

File ~\anaconda3\envs\pytorch-fast\lib\os.py:680, in _Environ.getitem(self, key)
677 value = self._data[self.encodekey(key)]
678 except KeyError:
679 # raise KeyError with the original key value
--> 680 raise KeyError(key) from None
681 return self.decodevalue(value)

KeyError: 'LOCAL_RANK'
`

openlmlab / lomo Goto Github PK

lomo's Introduction

News

Usage

LOMO: LOw-Memory Optimization

Implementation

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Citation

lomo's People

Contributors

Stargazers

Watchers

Forkers

lomo's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs