microsoft / deepspeed Goto Github PK

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Home Page: https://www.deepspeed.ai/

License: Apache License 2.0

Python 68.67% Shell 0.36% C++ 20.42% Cuda 10.05% Dockerfile 0.11% C 0.37% Batchfile 0.01%

deep-learning pytorch gpu machine-learning billion-parameters data-parallelism model-parallelism inference pipeline-parallelism compression

deepspeed's People

Stargazers

Watchers

Forkers

sureshkoochana sheikheddy vishalbelsare ananthc srikalyan reddragon googlecloudforum codeaudit shyamalschandra handsomekiwi fanshiqing allensmile yuimo amirstudy rml-admin kiminh gandalf012 xf05888 pltrdy easy-peasy endymecy igordzreyev binarycook perfmjs stjordanis huseyinozkilic baylee001 jaykimbravekjh ahmed-fau gazzola sirius3013 kuan-li sejin-p hjchen2 admin-boardinfinity anhnt170489 zhoudaqing felixzhang7 iabd shivlondon huangweiboy2 yyht mbrukman dreadlord1984 ashersyed jeffra magicknight hadryan dragomirradev tchigher 1rahul1 qsays chomolungma jbdatascience jeonsworld zhangruiskyline guneetkaur03 tien-le kustomzone wangxinfyfting jumagoca78 sailfish009 masums deniseduma cri5castro baburamshapure hbcbh1999 jun0207 octavianchen shadowkun max1mka1 santhu45482 kouml edenbuaa aramrami jinlmsft shadensmith lesliejackson cmdrmahesh qcompute pranganathp peace1998 chaoyue729 amit1nayak knowledgehacker benyangalg b0g-lab tpnguyen atm006 thaumkid rquispec keshabb b1sounours thaoh maciejmacko idkwim kyuhyoung 5gapp adileg sangkwun

deepspeed's Issues

Pip install support

Hi. I looked at the install.sh file, and it looks like pip installation support is doable. It would be great before somebody nabs the pypi deepspeed module name.

Just one question: Is it possible to train my model on a single GPU using this library and obtain the reported optimization benefits in memory consumption/training efficiency, or this is only achievable in case of using multiple GPUs?

Propagate PYTHONPATH like NCCL env variables

DeepSpeed run should propagate python path just like we are doing NCCL vars.

Install script assumes shared filesystem for multi-node installs

Install script assumes shared filesystem for multi-node installs
Install script should allow hostfile path to be passed in

Port core API documentation

Installation documentation needed

Load model checkpoint without loading the optimizer states.

Extend the load_checkpoint API to allow loading the checkpoint without loading the optimizer states. This is useful during evaluation and fine tuning. Need to make sure the FP32 bit model parameters are loaded along side the FP16 to avoid immediate model divergence when model is loaded without the optimizer states.

Update default deepspeed config

Simplify the DeepSpeed config json in the README to show reasonable defaults to start using DeepSpeed. (e.g., disable_allgather is here which probably doesn't need to be)

NoneType has no attribute to

I get the following error when I run my training. When I comment out this line the training works but the loss doesn't decrease

File "/storage/home/ec2-user/ner/trainer/trainer.py", line 131, in _train_epoch
  self.model.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 628, in step
  fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
AttributeError: 'NoneType' object has no attribute 'to'
  self.optimizer.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 165, in step
  fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
AttributeError: 'NoneType' object has no attribute 'to'

Turing NLG

Are there any plans to release the Turing NLG pre-trained model?

Following CIFAR Tutorial but Code Forcing RANK variable

I am trying to get DeepSpeed working and have been following the CIFAR tutorial example. In the example local_rank=-1 and dist_init_required=None as it is only with a single system (not distributed). However, it seems that it is forcing me to have RANK, LOCAL_RANK and other distributed environmental variables set. Should dist_init_required=False?

DDLRUN + DeepSpeed on SUMMIT

Hi,

I am trying to use deepspeed on SUMMIT using ddlrun, but it doesn't work properly.
I am testing it with cifar like:
ddlrun deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

Could you please give us an example for using deepspeed with horovod , mpi and ddlrun ?

Install details

Add more details the the install section of readme to talk about local vs multi node. Also update resource configuration section to discuss single node support.

max_grad_norm is ignored in FP16 training

I am currently fighting with a dynamic loss scale that is constantly decreasing due to gradient overflows. Setting max_grad_norm in the JSON config has no effect since it is overidden in deepspeed_light.py

DeepSpeed/deepspeed/pt/deepspeed_light.py

Lines 407 to 408 in 13fd3dc

 if self.fp16_enabled() and 'max_grad_norm' in optimizer_parameters.keys(): 

 optimizer_parameters['max_grad_norm'] = 0.0

I think this modification should be removed.

Kind regards

Local Install Issue - Apex

I noticed that in the install.sh file when doing a local install, sh install.sh -l, the installer seems to uninstall and reinstall apex twice. Is this normal behavior?

[Question]: Does DeepSpeed only for GPU clusters?

Does DeepSpeed also work for a cluster of VMs without GPUs?

training the 20 and 8 billion model failed on SUMMIT

Hello,

I am trying to train the 8 billion and the 20 billion models on SUMMIT and both failed.
SUMMIT has 6 Nvidia V100 16GB GPUs per node.
Both the 8 billion and the 20 billion give oom.

The training command is:

export MP_SIZE=6

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 100 \
       --hidden-size 3720 \
       --num-attention-heads 30 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert_nccl.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 72 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

The config file is:

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  } 
}

I am testing it on 1 node and even after I reduced the train batch size to 1, it didn't work:


The logs are:
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 04:40:19.647170: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 04:40:22.566024 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 04:40:22.567073 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 04:40:22.567220 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 04:40:22.567455: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 04:40:22.570236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 04:40:22.572765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 04:40:22.575278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 04:40:22.577850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 04:40:22.580415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 04:40:22.582986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 04:40:22.583008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 04:40:22.583068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 04:40:22.583108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 04:40:22.583146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 04:40:22.585072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 04:40:22.585118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 04:40:22.585156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 04:40:22.615387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 04:40:22.623295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 04:40:22.623314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 04:40:22.646660 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 04:40:25.123421 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 04:40:25.123578 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 04:40:25.123658 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 04:40:25.149839: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 04:40:25.153336 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 04:40:25.153439 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 04:40:25.154995 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 04:40:25.166115 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:125722:125722 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125722:125722 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125722 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:125724:125724 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125726:125726 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125727:125727 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125723:125723 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125971 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125725:125725 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125725:125725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125725:125725 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125725:125992 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:125723:125993 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:125724:125994 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:125726:125995 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:125727:125996 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:125722:125971 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:125722:125971 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:125726:125995 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:125725:125992 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:125724:125994 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:125722:125971 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:125722:125971 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:125722:125722 [0] NCCL INFO Launch mode Parallel
building BERT model ...
h36n18:125726:125995 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:125723:125993 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
 > number of parameters on model parallel rank 0: 2799983247
h36n18:125722:126579 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125722:126579 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:125722:126579 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 5: 2799983247
 > number of parameters on model parallel rank 3: 2799983247
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 579, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_bert_nccl.py", line 170, in setup_model_and_optimizer
    optimizer = get_optimizer(model, args)
  File "pretrain_bert_nccl.py", line 141, in get_optimizer
    'delayed_shift': args.hysteresis})
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 198, in __init__
    master_param = param.detach().clone().float()
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.50 GiB already allocated; 16.94 MiB free; 373.95 MiB cached; 0 bytes inactive)
 > number of parameters on model parallel rank 2: 2799983247
 > number of parameters on model parallel rank 1: 2799983247
 > number of parameters on model parallel rank 4: 2799983247


  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 05:07:35.425203: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 05:07:38.074505 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 05:07:38.074888 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 05:07:38.075031 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 05:07:38.075261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 05:07:38.078041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 05:07:38.080565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 05:07:38.083095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 05:07:38.085669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 05:07:38.088239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 05:07:38.090805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 05:07:38.090827: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 05:07:38.090887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 05:07:38.090926: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 05:07:38.090965: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 05:07:38.092861: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 05:07:38.092907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 05:07:38.092946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 05:07:38.123406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 05:07:38.130912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 05:07:38.130926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 05:07:38.154345 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 05:07:39.526942 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 05:07:39.527102 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 05:07:39.527187 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 05:07:39.553327: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 05:07:39.556849 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 05:07:39.556953 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 05:07:39.559207 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 05:07:39.570396 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:127714:127714 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127714:127714 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127714 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:127718 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127719:127719 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127717:127717 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127963 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127716:127716 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127715:127715 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127984 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127715:127985 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127719:127986 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127717:127987 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127718:127988 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127714:127963 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:127963 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127715:127985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127717:127987 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127716:127984 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:127714:127963 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:127963 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127715:127985 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:127988 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
building BERT model ...
 > number of parameters on model parallel rank 0: 1381032967
 > number of parameters on model parallel rank 1: 1381032967
 > number of parameters on model parallel rank 5: 1381032967
 > number of parameters on model parallel rank 3: 1381032967
 > number of parameters on model parallel rank 2: 1381032967
h36n18:127714:128267 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127714:128267 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127714:128267 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 4: 1381032967
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127715:128279 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127715:128279 [1] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127715:128279 [1] NCCL INFO comm 0x2001c8006620 rank 0 nranks 1 cudaDev 1 nvmlDev 1 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127719:128281 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128281 [5] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127719:128281 [5] NCCL INFO comm 0x2001ec006620 rank 0 nranks 1 cudaDev 5 nvmlDev 5 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127716:128283 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127716:128283 [2] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127716:128283 [2] NCCL INFO comm 0x200340006620 rank 0 nranks 1 cudaDev 2 nvmlDev 2 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127717:128286 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127717:128286 [3] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127717:128286 [3] NCCL INFO comm 0x200320006620 rank 0 nranks 1 cudaDev 3 nvmlDev 3 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:128288 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127718:128288 [4] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127718:128288 [4] NCCL INFO comm 0x2001f4006620 rank 0 nranks 1 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127718:128337 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128338 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127715:128339 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127717:128341 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127716:128340 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127714:128336 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:128336 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127719:128338 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO comm 0x200408006620 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127715:128339 [1] NCCL INFO comm 0x200424006620 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:128337 [4] NCCL INFO comm 0x200410006620 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127717:128341 [3] NCCL INFO comm 0x20033c006620 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:128336 [0] NCCL INFO comm 0x200718006620 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127716:128340 [2] NCCL INFO comm 0x20035c006620 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
learning rate decaying linear
Partition Activations False and Correctness Check False
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.75 GiB total capacity; 14.04 GiB already allocated; 580.94 MiB free; 200.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 2; 15.75 GiB total capacity; 14.13 GiB already allocated; 586.94 MiB free; 188.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 15.75 GiB total capacity; 14.13 GiB already allocated; 582.88 MiB free; 192.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 5; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 4; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 3; 15.75 GiB total capacity; 14.16 GiB already allocated; 558.94 MiB free; 192.72 MiB cached; 0 bytes inactive)

From my understanding from the paper on table 8 that you were able to train both the 8 and 20 billion models on 4 x 16GB GPU using 4 way model parallelism.
In my case I am using 6 way model parallelism with batch size 1 and it dosn't work.

Did I miss understood something?
Do you have any idea how to make it work ?

Azure tutorial + documentation

Unable to detect local GPU Resources

After running deepspeed locally, from the deepspeed/deepspeed:latest docker container, it is unable to detect my local NVIDIA GTX 1080.

Edit: I am on Windows 10 which is complicating this issue

When is T-NLG open source?

Fix cifar tutorial links

There are two dead links at top here: https://github.com/microsoft/DeepSpeed/blob/master/docs/tutorials/CIFAR-10.md

README : Missing list of supported Schedulers

In the README about JSON configuration, the list of supported Schedulers is missing :

https://github.com/microsoft/DeepSpeed/blob/master/docs/config_json.md#scheduler-parameters

Currently, only 3 schedulers are supported, right ?

From : https://microsoft.github.io/DeepSpeed/docs/htmlfiles/api/full/pt/deepspeed_lr_schedules.m.html

LRRangeTest
OneCycle
WarmupLR

Make optimizer field optional in JSON config

Missing optimizer field causes a crash, even though optimizer field is supposed to be optional.

Initialization of two nn.Modules (e.g. generator and discriminator)

Dear DeepSpeed-Team,

first of all thank you for your effort, I was very excited to hear about this approach.

I am currently trying to realize a GAN which requires me to initialize two networks, I tried the following without success:

generator_engine, _, _, __ = deepspeed.initialize(args=args,
                                                  model=self.generator,
                                                  model_parameters=filter(
                                                      lambda p: p.requires_grad,
                                                      self.generator.parameters()),
                                                  training_data=data)
discriminator_engine, _, data_loader, __ = deepspeed.initialize(args=args,
                                                                model=self.discriminator,
                                                                model_parameters=filter(
                                                                    lambda p: p.requires_grad,
                                                                    self.discriminator.parameters()),
                                                                training_data=data)

Executing this, I get the following:

DeepSpeed info: version=0.1.0, git-hash=6d60206, git-branch=master
File "/home/deepspeed/Code/identification/generative/CPGAN/orchestrator_msggan_deepspeed.py", line 253, in train
training_data=data)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/init.py", line 95, in initialize
Traceback (most recent call last):
File "identification/generative/CPGAN/train_deepspeed.py", line 186, in
collate_fn=collate_fn)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in init
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
main()
File "identification/generative/CPGAN/train_deepspeed.py", line 179, in main
raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!
save_every_n_steps=args.save_every_n_steps)
File "/home/deepspeed/Code/identification/generative/CPGAN/orchestrator_msggan_deepspeed.py", line 253, in train
training_data=data)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/init.py", line 95, in initialize
collate_fn=collate_fn)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in init
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!

It would be nice to get a pointer on how to tackle such a situation, especially since it is a very common use case.

Kind regards

ZeRO with non-zero loss scale crashes

Typically users want to use dynamic loss scaling. During some development of a new feature for ZeRO I discovered that ZeRO crashes when given a non-zero loss scale value in DeepSpeed's config JSON. I've created a unit test that shows it passes when ZeRO is disabled and another test with ZeRO enabled showing it triggers this bug so we can go back and test when it is fixed.

https://github.com/microsoft/DeepSpeed/blob/jeffra/zero_loss_scale_bug/tests/unit/test_fp16.py#L211-L285

Failed test with stack trace: https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/results?buildId=198&view=logs&j=75347757-894e-5c54-3c11-df095f4d729a&t=50de4b86-57af-55e8-ca98-b1a0d42235e2

Here's the stack trace:

Allow DeepSpeed init to accept dictionary instead of args

We should remove the need for args to be passed to deepspeed init, ideally we should accept a dictionary which represents everything so we can write smaller tests.

Support configuration for general devices and backends.

DeepSpeed is currently assumes NVIDIA GPUs using the NCCL backend. It would be nice to support more general configurations.

Non-exhaustive list of things to consider:

Configuration mechanisms (e.g., JSON config file, deepspeed.initialize())
Data movement
Resource querying and specification: we currently query the number of local GPUs and would need to add additional capabilities for CPUs, etc.
Documentation: often assumes GPUs and would need to be revised to be more general (e.g., https://github.com/microsoft/DeepSpeed#resource-configuration-multi-node)

Catch spawned process failures and terminate

The DeepSpeed launcher should detect failed processes and then ensure that the remaining children are joined with a timeout. The distributed_test decorator does this. We should more rigorously evaluate that and see if it's appropriate for deepspeed_run.

Support old and new apex optimizer fusion

apex now provides a new way of optimizer fusion which is incompatible with the old way. One key API difference is that the old step() required additional arguments, while the new step() takes no argument.

This work item will provide a smoother transition for users from old to new fusion by making desired fusion style a configuration option.

_init_distributed does not use dist_init_required

In the deepspeed_light.py script, the function _init_distributed gets passed dist_init_required, but this passed variable is not used. If dist_init_required is False this causes an AssetionError even know the functionality should be disabled.

add allreduce test

batch config issue?

There are a few things in configure train batch size that does not seem correct to me, and there are few things that we do not currently support.

The following assertion

train_batch_size == train_micro_batch_size_per_gpu * gradient_accumulation_step * world_size

should always hold but currently it does not in some cases.
For example, when train_micro_batch_size_per_gpu and gradient accumulation steps are None in the ds_cofig its initialized to train_batch_size and 1 respectively which leads to

train_batch_size == train_batch_size * 1 * world_size

if train_micro_batch_size_per_gpu > per_device_batch_size, we should throw a config error. Currently, its assigned to be equal to per_device_batch_size.
We do not currently support the user providing only the train_micro_batch_size or train_micro_batch_size and gradient _accumulation_steps.

Table of contents in README.md

Detect if init distributed is needed

Ideally we could remove the init_dist_required flag from deepspeed initialize if we can detect if it’s already been started. This can prevent some types of bugs like in #65

FP32 Mode for ZeRO

I figured out that ZeRO only works in FP16 mode, are there any plans to also introduce a FP32 mode?

Kind regards

ZeRO optimizer LAMB compatibility

My use case for this library is mostly for BERT models, as opposed to Megatron+ sized LMs. ZeRO in that context is mainly useful for fitting larger batch sizes and increasing throughput. For that reason, I'm wondering if/when you are planning on adding a ZeRO compatible LAMB optimizer.

Write CIFAR10 tutorial

How does DeepSpeed implement multi-machine model parallelism?

Hi
How does DeepSpeed implement multi-machine model parallelism, while pytorch only supports single-machine model parallelism.
Is there any other docs about DeepSpeed's model parallelism?

Distributed unit tests.

DeepSpeed needs an easy way to write distributed, multi-GPU unit tests.

Port DeepSpeed overview documentation

Undefined name: f --> bare except

https://github.com/microsoft/DeepSpeed/blob/master/tests/model/Megatron_GPT2/run_checkpoint_test.py#L113-L115

The variable f is an undefined name in this context. This will raise a NameError which will be caught by the bare except and No old checkpoint will be printed. How should f be defined?

Error while initializing multiple models

Hi.

I'm trying to use deepspeed in my code with multiple models, but got an error like below. Do you have any idea to solve this issue? Thanks in advance.

  File "train_ds.py", line 98, in <module>
    solver = Solver(opt)
  File "/data2/1konny/svg/solver_ds.py", line 40, in __init__
    self.init_models_and_optimizers()
  File "/data2/1konny/svg/solver_ds.py", line 117, in init_models_and_optimizers
    self.decoder, self.decoder_optimizer, _, _ = ds.initialize(opt, model=decoder, model_parameters=decoder_params)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/__init__.py", line 87, in initialize
    collate_fn=collate_fn)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in __init__
    dist.init_process_group(backend="nccl")
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
    raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!

ds_config.json

{
  "train_batch_size": 4,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0001,
      "max_grad_norm": 1.0,
      "betas": [
         0.9,
         0.999
       ]
    }
  }
}

command-line

deepspeed train_ds.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json ...

code

training_data = load_dataset()
encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())
self.encoder, self.encoder_optim, train_loader, _ = deepspeed.initialize(opt, model=encoder, model_parameters=encoder_params, training_data=training_data)
self.decoder, self.decoder_optim, _, _ = deepspeed.initialize(opt, model=decoder, model_parameters=decoder_params)

Conda Environment Install Issue

Trying to get DeepSpeed installed for local use with a Conda environment, but it seems that DeepSpeed in not installing to the environment itself. After building the wheel DeepSpeed is not installing into the proper Conda conda environment location. Apex is installing in the proper environment location. Unclear why DeepSpeed is not working but Apex is.

Config and core arguments API docstrings

The documentation for add_core_arguments() and add_config_arguments() could be expanded... Can we document what arguments are added in the docstrings? (these things could be unit tested too)

Megatron tutorial

We need to port the Megatron tutorial into docs/ and then update links to it in README.md, etc.

TypeError: FP16_DeepSpeedZeroOptimizer is not an Optimizer

I'm trying to use 1-Cycle scheduler, but I meet the following error :

TypeError: FP16_DeepSpeedZeroOptimizer is not an Optimizer

Here is my configuration file :

{
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 3e-05,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "gradient_clipping": 0.1,
    "scheduler": {
        "type": "OneCycle",
        "params": {
            "cycle_first_step_size": 16000,
            "cycle_first_stair_count": 8000,
            "decay_step_size": 16000,
            "cycle_min_lr": 1e-06,
            "cycle_max_lr": 3e-05,
            "decay_lr_rate": 1e-07,
            "cycle_min_mom": 0.85,
            "cycle_max_mom": 0.99,
            "decay_mom_rate": 0.0
        }
    },
    "zero_optimization": true,
    "disable_allgather": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "min_loss_scale": 1
    }
}

When using another Scheduler (with FP16), I meet no problem.

Detect split brain issues w.r.t. user code and deepspeed batch sizes

Often user code will have a user-defined batch size and the DeepSpeed config json will have it's own batch size. When using gradient accumulation this can cause bugs where DeepSpeed thinks grad accumulation steps should be different than what user code is doing.

If the user is using the default collate_fn then DeepSpeed should be able to detect and throw an exception in these cases. We can check to see what batch size is being passed in the forward pass by inspecting the first dimension.

Lastly, we probably want to add a error suppression flag in the DeepSpeed config to allow users to turn off this error if they know what they are doing and their batch alignment is non-standard.

train_batch_size + dataset + actual batch size

Hello,

I have 4 questions for clarification:

Why we should pass the training_data to the deepspeed.initialize to generate a new trainloader rather than using a normal torch trainloader ?
Can we use a custom pytorch trainloader in case we have custom dataset that returns for example inputs, outputs and mask ?
If the actual batch size that is used to be passed to the model is different than the train_batch_size in the json file, what will happen ?
Can we just define gradient_accumulation_steps and train_micro_batch_size_per_gpu
only and leave deepspeed to calculate train_batch_size automatically ?

pytorch gradient checkpointing is much better than deepspeed !

Hello,

I have a script that trains 12 layers transformer model (about 85 million) using gradient checkpoint. It was working with a local batch size of 32 per Nvidia Titan GPU.
I tried to use deepspeed instead and I am always getting OOM, even with a batch size 8.

minimal code:
Initialization:

model = models.TransformerModel(ntokens, args.emsize, args.nhead, args.nhid, args.nlayers, args.dropout)

parameters = filter(lambda p: p.requires_grad, model.parameters())

model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=parameters)

Training:

with tqdm(total=int(args.log_interval),
              desc='Train Step     #{}-{}'.format(step + 1,step+args.log_interval),
              disable=False) as t:
        for batch_idx, batch in enumerate(datasetGenerator):

            data, target,src_padding = batch['input'].to(model_engine.local_rank), batch['target'].to(model_engine.local_rank), batch['padding_mask'].to(model_engine.local_rank)
            

            output = model_engine(data, has_mask=False,src_key_padding_mask = src_padding.t())


            train_accuracy.update(accuracy(target, output))
            loss = criterion(output.view(-1, ntokens), target.view(-1))

            model_engine.backward(loss)
            model_engine.step()

            t.set_postfix({'loss': train_loss.avg.item(),
                           'accuracy': 100. * train_accuracy.avg.item()})
            t.update(1)

Original Transformer code with gradient checkpointing:

def forward(self, src, mask=None, src_key_padding_mask=None):
        r"""Pass the input through the encoder layers in turn.
        Args:
            src: the sequnce to the encoder (required).
            mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """
        output = src

        for i in range(self.num_layers):
            #output = self.layers[i](output, src_mask=mask,
            #                        src_key_padding_mask=src_key_padding_mask)
            output = checkpoint(self.layers[i], output, mask, src_key_padding_mask)


        if self.norm:
            output = self.norm(output)

        return output

The working batch size for the dataloader is only 4.
Any idea how can I achieve the same batch size as gradient checkpoints with deepspeed?

DeepSpeed using DistributedSampler with model parallelism

DeepSpeed's data loader will use DistributedSampler by default unless another is provided:

DeepSpeed/deepspeed/pt/deepspeed_dataloader.py

Line 43 in 001abe2

data_sampler = DistributedSampler(dataset)

If DeepSpeed is configured with model parallelism, or called from a library with a sub-group of the world processes, the default behavior of DistributedSampler is incorrect because it queries the global world size and rank information. We should specify num_replicas and rank when creating the sampler.

If mpu is provided to deepspeed.initialize(), we should query mpu.get_data_parallel_world_size() and mpu.get_data_parallel_rank() and forward that information to the sampler.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

	if self.fp16_enabled() and 'max_grad_norm' in optimizer_parameters.keys():
	optimizer_parameters['max_grad_norm'] = 0.0