Comments (4)
@maanug-nv , could you look at this one?
from nemo.
@ericharper , @maanug-nv ;
I also tried running on slurm cluster, please find logs below.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:176]
name: megatron_gpt_sft
trainer:
devices:8
accelerator:gpu
num_nodes:2
precision:bf16
logger:false
enable_checkpointing:false
use_distributed_sampler:false
max_epochs:9999
max_steps:50
log_every_n_steps:10
val_check_interval:1.0
gradient_clip_val:1.0
exp_manager:
explicit_log_dir:/workspace/result
exp_dir:null
name:${name}
create_wandb_logger:false
wandb_logger_kwargs:
project:null
name:null
resume_if_exists:true
resume_ignore_no_checkpoint:true
create_checkpoint_callback:true
checkpoint_callback_params:
monitor: validation_loss
save_top_k:2
mode:max
save_nemo_on_train_end:true
filename: megatron_gpt_sft--{${exp_manager.checkpoint_callback_params.monitor}:.3f}-{step}-{consumed_samples}
model_parallel_size:${model.tensor_model_parallel_size}
save_best_model:false
model:
seed:1234
tensor_model_parallel_size:4
pipeline_model_parallel_size:4
global_batch_size:128
micro_batch_size:1
restore_from_path:/workspace/llama27b.nemo
resume_from_checkpoint:null
save_nemo_on_validation_end:true
sync_batch_comm:false
megatron_amp_O2:true
sequence_parallel:true
activations_checkpoint_granularity:selective
activations_checkpoint_method: uniform
activations_checkpoint_num_layers: null
activations_checkpoint_layers_per_pipeline: null
answer_only_loss: true
gradient_as_bucket_view: false
seq_len_interpolation_factor: null
use_flash_attention: null
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
data:
chat: false
chat_prompt_tokens:
system_turn_start: <extra_id_0>
turn_start: <extra_id_1>
label_start: <extra_id_2>
end_of_turn: '
'
end_of_name: '
'
train_ds:
file_names:
- /workspace/self_instruct_data/training.jsonl
global_batch_size: 128
micro_batch_size: 1
shuffle: true
num_workers: 0
memmap_workers: null
pin_memory: true
max_seq_length: 512
min_seq_length: 1
drop_last: true
concat_sampling_probabilities:
- 1
label_key: output
add_eos: true
add_sep: false
add_bos: false
truncation_field: input
index_mapping_dir: null
prompt_template: '{input} {output}'
hf_dataset: false
truncation_method: right
validation_ds:
file_names:
- /workspace/self_instruct_data/validation.jsonl
names: null
global_batch_size: 128
micro_batch_size: 1
shuffle: false
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: true
max_seq_length: 512
min_seq_length: 1
drop_last: false
label_key: ${model.data.train_ds.label_key}
add_eos: ${model.data.train_ds.add_eos}
add_sep: ${model.data.train_ds.add_sep}
add_bos: ${model.data.train_ds.add_bos}
write_predictions_to_file: false
output_file_path_prefix: null
truncation_field: ${model.data.train_ds.truncation_field}
index_mapping_dir: null
prompt_template: ${model.data.train_ds.prompt_template}
tokens_to_generate: 32
hf_dataset: false
truncation_method: right
metric:
name: loss
average: null
num_classes: null
test_ds:
file_names:
- /workspace/self_instruct_data/test.jsonl
names: null
global_batch_size: 256
micro_batch_size: 1
shuffle: false
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: true
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: false
label_key: ${model.data.train_ds.label_key}
add_eos: ${model.data.train_ds.add_eos}
add_sep: ${model.data.train_ds.add_sep}
add_bos: ${model.data.train_ds.add_bos}
write_predictions_to_file: false
output_file_path_prefix: null
truncation_field: ${model.data.train_ds.truncation_field}
index_mapping_dir: null
prompt_template: ${model.data.train_ds.prompt_template}
tokens_to_generate: 32
hf_dataset: false
truncation_method: right
metric:
name: loss
average: null
num_classes: null
optim:
name: distributed_fused_adam
lr: 5.0e-06
weight_decay: 0.01
betas:
- 0.9
- 0.98
inference:
greedy: true
top_k: 0
top_p: 0.9
temperature: 1.0
all_probs: false
repetition_penalty: 1.2
min_tokens_to_generate: 0
compute_logprob: false
compute_attention_mask: true
cluster_type: BCP
[NeMo W 2024-05-23 11:06:21 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/lightning_fabric/connector.py:554: UserWarning: bf16 is supported for historical reasons but its usage is discouraged. Please se
t your precision to bf16-mixed instead!
rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
[NeMo E 2024-05-23 11:06:21 exp_manager:556] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2024-05-23 11:06:21 exp_manager:708] Exp_manager is logging to /workspace/result, but it already exists.
[NeMo W 2024-05-23 11:06:21 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/result/checkpoints. Training from scratch.
[NeMo I 2024-05-23 11:06:21 exp_manager:396] Experiments will be logged at /workspace/result
[NeMo I 2024-05-23 11:06:21 exp_manager:856] TensorboardLogger has been set up
[NeMo W 2024-05-23 11:06:21 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensu
re that checkpointing will not error out.
[NeMo I 2024-05-23 11:06:21 megatron_gpt_sft:213] Resuming training from checkpoint: None
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[NeMo I 2024-05-23 11:06:28 megatron_init:253] Rank 0 has data parallel group : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:259] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:264] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:267] Ranks 0 has data parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:284] Rank 0 has context parallel group: [0]
[NeMo I 2024-05-23 11:06:28 megatron_init:287] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:288] Ranks 0 has context parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:299] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[NeMo I 2024-05-23 11:06:28 megatron_init:300] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:310] Rank 0 has tensor model parallel group: [0, 1, 2, 3]
[NeMo I 2024-05-23 11:06:28 megatron_init:314] All tensor model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:315] Rank 0 has tensor model parallel rank: 0
[NeMo I 2024-05-23 11:06:28 megatron_init:344] Rank 0 has pipeline model parallel group: [0, 4, 8, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:356] Rank 0 has embedding group: [0, 12]
[NeMo I 2024-05-23 11:06:28 megatron_init:362] All pipeline model parallel group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:363] Rank 0 has pipeline model parallel rank 0
[NeMo I 2024-05-23 11:06:28 megatron_init:364] All embedding group ranks: [[0, 4, 8, 12], [1, 5, 9, 13], [2, 6, 10, 14], [3, 7, 11, 15]]
[NeMo I 2024-05-23 11:06:28 megatron_init:365] Rank 0 has embedding rank: 0
24-05-23 11:06:28 - PID:154683 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 128
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo I 2024-05-23 11:06:28 tokenizer_utils:185] Getting SentencePiece with model: /tmp/tmpyuitwp3o/a290efe8ded54b8da6a27eb8ecea4895_tokenizer.model
[NeMo I 2024-05-23 11:06:28 megatron_base_model:574] Padded vocab_size: 32256, original vocab_size: 32000, dummy tokens: 256.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurab
le.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_num_layers in its cfg. Add this key to cfg or config_mapping to make to make it config
urable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: _cpu_offloading_context in its cfg. Add this key to cfg or config_mapping to make to make it configur
able.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_activations in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: cpu_offloading_weights in its cfg. Add this key to cfg or config_mapping to make to make it configura
ble.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:1139] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurabl
e.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:489] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: add_qkv_bias in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: rotary_interleaved in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: window_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: memory_efficient_layer_norm in its cfg. Add this key to cfg or config_mapping to make to make it confi
gurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-05-23 11:06:28 megatron_base_model:546] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it
configurable.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:12312.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:12312.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 12312.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
Matplotlib created a temporary cache directory at /tmp/matplotlib-518pojrm because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-qze86_xq because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-rrzai_1n because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-449boyjo because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-9w_wgl4h because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-d5pwia0k because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-w0euwkph because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Matplotlib created a temporary cache directory at /tmp/matplotlib-wgywjpl6 because the default path (/root/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environme
nt variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/16
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (x.x.x.x, 12312).
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
[I socket.cpp:813] [c10d] No socket on (x.x.x.x, 12312) is listening yet, will retry.
…. Continue retrying
from nemo.
@ericharper @maanug-nv , following up regarding issue posted above. please let me know if any other information needed.
from nemo.
Hi @PurvangL , I see you've closed this issue, were you able to resolve?
I haven't had time to reproduce this issue with SFT, but I've encountered long init times with pretraining that might seem like hangs, but eventually start training.
Sorry for lack of response, if I can get around to reproducing this specific case, I'll let you know. We are also looking into these long init times.
from nemo.
Related Issues (20)
- Job specific environment variables can't be set in Hydra multi-run HOT 2
- Using lhotse when training a hybrid fast conformer model fails HOT 7
- How to config a locally model?
- Unable to reproduce cache aware streaming results for Conformer that were there for Fastconformer.
- Can we add emotions to the produced audio?
- LM on Parakeet models HOT 1
- to support deepseekv2
- How to use a pre-trained model for cache-aware FastConformer-Hybrid model? HOT 3
- When Trying to import nlp collections in the Nemo Primer getting error "No Module named megatron"
- How to export SLUIntentSlotBPEModel to ONNX HOT 1
- issue about self attention with mask
- Converting megatron checkpoint to .nemo without the same environment
- Nemo container for Nemotron 340B inference fails pytorch_lightning import HOT 1
- Can you support DoRA?
- Unable to reproduce cache aware streaming results for Conformer that were there for Fastconformer.
- Issue: TimeError Occurring During Training on node 16 or more
- Speaker Diarization goes haywire due to small segments of audio
- MCore slower than NeMo native implementation
- FSDP CPU offloading errors out due to device placements
- Getting empty results from online streaming asr. Please help me!!!!! thanks a lot.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.