Describe the bug I am following <a href="https://do

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us
Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="
nvidia,nemo

Comments (4)

ericharper commented on July 4, 2024
@maanug-nv , could you look at this one?
from nemo.
PurvangL commented on July 4, 2024
@ericharper , @maanug-nv ;
I also tried running on slurm cluster, please find logs below.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:176]                                                                                                                                                                     
    name: megatron_gpt_sft                                                                                                                                                                                            
    trainer:                                                                                                                                                                                                          
      devices:8                                                                                                                                                                                                      
      accelerator:gpu                                                                                                                                                                                                
      num_nodes:2                                                                                                                                                                                                    
      precision:bf16                                                                                                                                                                                                 
      logger:false                                                                                                                                                                                                   
      enable_checkpointing:false                                                                                                                                                                                     
      use_distributed_sampler:false                                                                                                                                                                                  
      max_epochs:9999                                                                                                                                                                                                
      max_steps:50                                                                                                                                                                                                   
      log_every_n_steps:10                                                                                                                                                                                           
      val_check_interval:1.0                                                                                                                                                                                         
      gradient_clip_val:1.0                                                                                                                                                                                          
    exp_manager:                                                                                                                                                                                                      
      explicit_log_dir:/workspace/result                                                                                                                                                                             
      exp_dir:null                                                                                                                                                                                                   
      name:${name}                                                                                                                                                                                                   
      create_wandb_logger:false                                                                                                                                                                                      
      wandb_logger_kwargs:                                                                                                                                                                                            
        project:null                                                                                                                                                                                                 
        name:null                                                                                                                                                                                                    
      resume_if_exists:true                                                                                                                                                                                          
      resume_ignore_no_checkpoint:true                                                                                                                                                                               
      create_checkpoint_callback:true                                                                                                                                                                                
      checkpoint_callback_params:                                                                                                                                                                                     
        monitor: validation_loss                                                                                                                                                                                      
        save_top_k:2                                                                                                                                                                                                 
        mode:max                                                                                                                                                                                                     
        save_nemo_on_train_end:true                                                                                                                                                                                  
        filename: megatron_gpt_sft--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{consumed_samples}                                                                                                 
        model_parallel_size:${model.tensor_model_parallel_size}                                                                                                                                                      
        save_best_model:false                                                                                                                                                                                        
    model:                                                                                                                                                                                                            
      seed:1234                                                                                                                                                                                                      
      tensor_model_parallel_size:4                                                                                                                                                                                   
      pipeline_model_parallel_size:4                                                                                                                                                                                 
      global_batch_size:128                                                                                                                                                                                          
      micro_batch_size:1                                                                                                                                                                                             
      restore_from_path:/workspace/llama27b.nemo                                                                                                                                                                    
      resume_from_checkpoint:null                                                                                                                                                                                    
      save_nemo_on_validation_end:true                                                                                                                                                                               
      sync_batch_comm:false                                                                                                                                                                                          
      megatron_amp_O2:true                                                                                                                                                                                           
      sequence_parallel:true                                                                                                                                                                                         
      activations_checkpoint_granularity:selective                                                                                                                                                                   
      activations_checkpoint_method: uniform  
      activations_checkpoint_num_layers: null
      activations_checkpoint_layers_per_pipeline: null
      answer_only_loss: true
      gradient_as_bucket_view: false
      seq_len_interpolation_factor: null
      use_flash_attention: null
      hidden_dropout: 0.0
      attention_dropout: 0.0
      ffn_dropout: 0.0
      data:
        chat: false
        chat_prompt_tokens:
          system_turn_start: <extra_id_0>
          turn_start: <extra_id_1>
          label_start: <extra_id_2>
          end_of_turn: '
     
            '
          end_of_name: '
     
            '
        train_ds:
          file_names:
          - /workspace/self_instruct_data/training.jsonl
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: true
          num_workers: 0
          memmap_workers: null
          pin_memory: true
          max_seq_length: 512
          min_seq_length: 1
          drop_last: true
          concat_sampling_probabilities:
          - 1
          label_key: output
          add_eos: true
          add_sep: false
          add_bos: false
          truncation_field: input
          index_mapping_dir: null
          prompt_template: '{input} {output}'
          hf_dataset: false
          truncation_method: right
        validation_ds:
          file_names:
          - /workspace/self_instruct_data/validation.jsonl
          names: null
          global_batch_size: 128
          micro_batch_size: 1
          shuffle: false
          num_workers: 0
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: 512
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key} 
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          hf_dataset: false
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
        test_ds:
          file_names:
          - /workspace/self_instruct_data/test.jsonl
          names: null
          global_batch_size: 256
          micro_batch_size: 1
          shuffle: false
          num_workers: 0
          memmap_workers: ${model.data.train_ds.memmap_workers}
          pin_memory: true
          max_seq_length: ${model.data.train_ds.max_seq_length}
          min_seq_length: 1
          drop_last: false
          label_key: ${model.data.train_ds.label_key} 
          add_eos: ${model.data.train_ds.add_eos}
          add_sep: ${model.data.train_ds.add_sep}
          add_bos: ${model.data.train_ds.add_bos}
          write_predictions_to_file: false
          output_file_path_prefix: null
          truncation_field: ${model.data.train_ds.truncation_field}
          index_mapping_dir: null
          prompt_template: ${model.data.train_ds.prompt_template}
          tokens_to_generate: 32
          hf_dataset: false
          truncation_method: right
          metric:
            name: loss
            average: null
            num_classes: null
      optim:
        name: distributed_fused_adam
        lr: 5.0e-06
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
    inference:
      greedy: true
      top_k: 0
      top_p: 0.9
      temperature: 1.0
      all_probs: false
      repetition_penalty: 1.2
      min_tokens_to_generate: 0
      compute_logprob: false
      compute_attention_mask: true
    cluster_type: BCP
 
[NeMo W 2024-05-23 11:06:21 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:554: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please se
t your precision to bf16-mixed instead!
      rank_zero_warn(
     
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo E 2024-05-23 11:06:21 exp_manager:556] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2024-05-23 11:06:21 exp_manager:708] Exp_manager is logging to /workspace/result, but it already exists.
[NeMo W 2024-05-23 11:06:21 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/result/checkpoints. Training from scratch.
[NeMo I 2024-05-23 11:06:21 exp_manager:396] Experiments will be logged at /workspace/result
[NeMo I 2024-05-23 11:06:21 exp_manager:856] TensorboardLogger has been set up
[NeMo W 2024-05-23 11:06:21 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensu
re that checkpointing will not error out.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:213] Resuming training from checkpoint: None
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-05-23 11:06:28 megatron_init:253] Rank 0 has data parallel group : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:259] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:264] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:267] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:284] Rank 0 has context parallel group: [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:287] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:288] Ranks 0 has context parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:299] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[NeMo I 2024-05-23 11:06:28 megatron_init:300] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:310] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-05-23 11:06:28 megatron_init:314] All tensor model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:315] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:344] Rank 0 has pipeline model parallel group: [0, 4, 8, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:356] Rank 0 has embedding group: [0, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:362] All pipeline model parallel group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:363] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-05-23 11:06:28 megatron_init:364] All embedding group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:365] Rank 0 has embedding rank: 0
24-05-23 11:06:28 - PID:154683 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo I 2024-05-23 11:06:28 tokenizer_utils:185] Getting SentencePiece with model: /tmp/tmpyuitwp3o/a290efe8ded54b8da6a27eb8ecea4895_tokenizer.model
[NeMo I 2024-05-23 11:06:28 megatron_base_model:574] Padded vocab_size: 32256, original vocab_size: 32000, dummy tokens: 256.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:489] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: add_qkv_bias in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: rotary_interleaved in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it
 configurable.
 
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:12312.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:12312.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 12312.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
Matplotlib created a temporary cache directory at /tmp/matplotlib-518pojrm because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-qze86_xq because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-rrzai_1n because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-449boyjo because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-9w_wgl4h because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-d5pwia0k because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-w0euwkph because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-wgywjpl6 because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
…. Continue retrying
from nemo.
PurvangL commented on July 4, 2024
@ericharper @maanug-nv , following up regarding issue posted above. please let me know if any other information needed.
from nemo.
maanug-nv commented on July 4, 2024
Hi @PurvangL , I see you've closed this issue, were you able to resolve?
I haven't had time to reproduce this issue with SFT, but I've encountered long init times with pretraining that might seem like hangs, but eventually start training.
Sorry for lack of response, if I can get around to reproducing this specific case, I'll let you know. We are also looking into these long init times.
from nemo.
llama2 training hangs when pp_size > 1 about nemo HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs