I am receiving an error during 3KG pretraining, which occurs only during distributed training.
> fairseq-hydra-train task.data=./manifest/total_all_patient_ids --config-dir ./fairseq-signals/examples/3kg/config/pretraining/ecg_transformer --config-name 3kg
hydra_main - cfg
{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': json, 'log_file': None, 'wandb_project': None, 'wandb_entity': None, 'seed': 1, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'profile': False, 'reset_logging': False, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 3, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': 12355, 'device_id': 0, 'ddp_comm_hook': none, 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'fp16': '${common.fp16}', 'memory_efficient_fp16': '${common.memory_efficient_fp16}'}, 'dataset': {'_name': None, 'num_workers': 6, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': '', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': True, 'max_tokens_valid': '${dataset.max_tokens}', 'batch_size_valid': '${dataset.batch_size}', 'max_valid_steps': None, 'curriculum': 0, 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 200, 'max_update': 0, 'lr': [5e-05], 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'update_freq': [1], 'stop_min_lr': -1.0}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False}, 'model': {'_name': 'ecg_transformer', 'apply_mask': False, 'dropout_input': 0.1, 'dropout_features': 0.1, 'feature_grad_mult': 0.1, 'encoder_embed_dim': 768, 'in_d': 12}, 'task': {'_name': 'ecg_pretraining', 'data': './manifest/total_all_patient_ids', 'normalize': False, 'enable_padding': True, 'inferred_3kg_config': {'angle': 45, 'scale': 1.5, 'mask_ratio': 0.5}}, 'criterion': {'_name': '3kg'}, 'lr_scheduler': {'_name': 'fixed', 'warmup_updates': 0}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01}}
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 2): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 0): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 1): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 0
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 1
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 2
[2023-07-24 10:07:09,327][fairseq_cli.train][INFO] - {'_name': None,
'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False},
'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'wandb_project': None, 'wandb_entity': None, 'seed': 1, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'profile': False, 'reset_logging': False, 'suppress_crashes': False},
'common_eval': {'_name': None, 'path': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None},
'criterion': {'_name': '3kg', 'temp': 0.1, 'eps': 1e-08},
'dataset': {'_name': None, 'num_workers': 6, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': '', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': True, 'max_tokens_valid': None, 'batch_size_valid': 64, 'max_valid_steps': None, 'curriculum': 0, 'num_shards': 1, 'shard_id': 0},
'distributed_training': {'_name': None, 'distributed_world_size': 3, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:18583', 'distributed_port': 12355, 'device_id': 0, 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'fp16': False, 'memory_efficient_fp16': False},
'job_logging_cfg': {'version': 1, 'formatters': {'simple': {'format': '[%(asctime)s][%(name)s][%(levelname)s] - %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'stream': 'ext://sys.stdout'}, 'file': {'class': 'logging.FileHandler', 'formatter': 'simple', 'filename': 'hydra_train.log'}}, 'root': {'level': 'INFO', 'handlers': ['console', 'file']}, 'disable_existing_loggers': False},
'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [5e-05]},
'model': {'_name': 'ecg_transformer', 'normalize': False, 'filter': False, 'data': './manifest/total_all_patient_ids', 'args': None, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'layer_norm_first': False, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.1, 'dropout_features': 0.1, 'apply_mask': False, 'mask_length': 10, 'mask_prob': 0.0, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'extractor_mode': 'default', 'conv_feature_layers': '[(256, 2, 2)] * 4', 'in_d': 12, 'conv_bias': False, 'feature_grad_mult': 0.1, 'conv_pos': 128, 'conv_pos_groups': 16},
'optimization': {'_name': None, 'max_epoch': 200, 'max_update': 0, 'lr': [5e-05], 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'update_freq': [1], 'stop_min_lr': -1.0},
'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, 'lr': [5e-05]},
'task': {'_name': 'ecg_pretraining', 'data': './manifest/total_all_patient_ids', 'leads_to_load': None, 'leads_bucket': None, 'bucket_selection': 'uniform', 'sample_rate': None, 'filter': False, 'normalize': False, 'mean_path': None, 'std_path': None, 'enable_padding': True, 'enable_padding_leads': False, 'max_sample_size': None, 'min_sample_size': None, 'num_batch_buckets': 0, 'precompute_mask_indices': False, 'perturbation_mode': None, 'p': [1.0], 'max_amplitude': 0.1, 'min_amplitude': 0.0, 'dependency': True, 'shift_ratio': 0.2, 'num_segment': 1, 'max_freq': 0.2, 'min_freq': 0.01, 'k': 3, 'mask_leads_selection': 'random', 'mask_leads_prob': 0.5, 'mask_leads_condition': [4, 5], 'inferred_w2v_config': None, 'inferred_3kg_config': {'angle': 45, 'scale': 1.5, 'mask_ratio': 0.5}, 'criterion_name': '3kg', 'model_name': None, 'clocs_mode': None}}
[2023-07-24 10:07:11,099][fairseq_cli.train][INFO] - ECGTransformerModel(
(dropout_input): Dropout(p=0.1, inplace=False)
(dropout_features): Dropout(p=0.1, inplace=False)
(encoder): TransformerEncoder(
(layers): ModuleList(
(0-11): 12 x TransformerEncoderLayer(
(self_attn): MultiHeadAttention(
(dropout): Dropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(feature_extractor): ConvFeatureExtraction(
(conv_layers): ModuleList(
(0): Sequential(
(0): Conv1d(12, 256, kernel_size=(2,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): Fp32GroupNorm(256, 256, eps=1e-05, affine=True)
(3): GELU(approximate='none')
)
(1-3): 3 x Sequential(
(0): Conv1d(256, 256, kernel_size=(2,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU(approximate='none')
)
)
)
(post_extract_proj): Linear(in_features=256, out_features=768, bias=True)
(conv_pos): ConvPositionalEncoding(
(pos_conv): Sequential(
(0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
(1): SamePad()
(2): GELU(approximate='none')
)
)
(layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - task: ECGPretrainingTask
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - model: ECGTransformerModel
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - criterion: ThreeKGCriterion
[2023-07-24 10:07:11,103][fairseq_cli.train][INFO] - num. shared model params: 90,373,248 (num. trained: 90,373,248)
[2023-07-24 10:07:11,104][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0)
[2023-07-24 10:07:11,175][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.1.0.bias
[2023-07-24 10:07:11,176][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.2.0.bias
[2023-07-24 10:07:11,176][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.3.0.bias
[2023-07-24 10:07:11,177][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-07-24 10:07:11,228][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 3 nodes.
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - ***********************CUDA enviroments for all 3 workers***********************
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank 0: capabilities = 8.0 ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank 1: capabilities = 8.0 ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank 2: capabilities = 8.0 ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - ***********************CUDA enviroments for all 3 workers***********************
[2023-07-24 10:07:11,478][fairseq_cli.train][INFO] - training on 3 devices (GPUs)
[2023-07-24 10:07:11,479][fairseq_cli.train][INFO] - max tokens per device = None and signals per device = 64
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - No existing checkpoint found ()
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - loading train data for epoch 1
[2023-07-24 10:07:12,133][fairseq_signals.data.ecg.raw_ecg_dataset][INFO] - loaded 760412, skipped 0 samples
[2023-07-24 10:07:23,789][fairseq_signals.trainer][INFO] - NOTE: your device may support faster training with --fp16
[2023-07-24 10:07:23,799][fairseq_signals.trainer][INFO] - begin training epoch 1
[2023-07-24 10:07:23,800][fairseq_cli.train][INFO] - Start iterating over samples
Traceback (most recent call last):
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
lambda: hydra.run(
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/core/utils.py", line 129, in run_job
ret.return_value = task_function(task_cfg)
File "/home/fairseq-signals/fairseq_cli/hydra_train.py", line 50, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/home/fairseq-signals/fairseq_signals/distributed/utils.py", line 126, in call_main
torch.multiprocessing.spawn(
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/fairseq-signals/fairseq_signals/distributed/utils.py", line 112, in distributed_main
main(cfg, **kwargs)
File "/home/fairseq-signals/fairseq_cli/train.py", line 160, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/home/anaconda3/envs/ecg_env/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/fairseq-signals/fairseq_cli/train.py", line 264, in train
log_output = trainer.train_step(samples)
File "/home/anaconda3/envs/ecg_env/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/home/fairseq-signals/fairseq_signals/trainer.py", line 569, in train_step
raise e
File "/home/fairseq-signals/fairseq_signals/trainer.py", line 537, in train_step
loss, sample_size_i, logging_output = self.task.train_step(
File "/home/fairseq-signals/fairseq_signals/tasks/task.py", line 333, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fairseq-signals/fairseq_signals/criterions/3kg_criterion.py", line 73, in forward
pos_mask = torch.masked_select(
RuntimeError: The size of tensor a (128) must match the size of tensor b (384) at non-singleton dimension 1
Again, the training works fine when using a single GPU. It appears that tensor dim(b) = dim(a) * n_gpu
- this holds true for 2 GPUs, different batch sizes, etc., so perhaps the loss calculation is not adapted to function under a DistributedDataParallel setup?