jwoo5 / fairseq-signals Goto Github PK

View Code? Open in Web Editor NEW

46.0 3.0 7.0 4.2 MB

A collection of deep learning models for ECG data processing based on fairseq framework

License: Other

Python 96.56% C++ 1.43% Cuda 1.36% Cython 0.65%

python pytorch fairseq ecg ekg

fairseq-signals's Introduction

Fairseq-signals

Fairseq-signals is a collection of deep learning models for ECG data processing based on the fairseq.

We provide implementations of various deep learning methods on ECG data, including official implementations of our works.

List of implemented papers:

* denotes for an official implementation

We will keep implementing new methods in this repo. If you have any recommendations, please contact us via an issue or an e-mail.

Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To install fairseq-signals from source and develop locally:

git clone https://github.com/Jwoo5/fairseq-signals
cd fairseq-signals
pip install --editable ./

To preprocess ECG datasets: pip install scipy wfdb
To build cython components: python setup.py build_ext --inplace
For large datasets install PyArrow: pip install pyarrow

Getting Started

For uni-modal tasks (ECG Classification, ...)

Prepare ECG dataset

We provide pre-processing codes for various ECG datasets.

Pre-process

Given a directory that contains WFDB directories to be pre-processed for PhysioNet2021:

$ python fairseq_signals/data/ecg/preprocess/preprocess_physionet2021.py \
    /path/to/physionet2021/ \
    --dest /path/to/output \
    --workers $N

Given a directory that contains .dat files from PTB-XL:

$ python fairseq_signals/data/ecg/preprocess/preprocess_ptbxl.py \
    /path/to/ptbxl/records500/ \
    --dest /path/to/output

Prepare data manifest

Given a directory that contains pre-processed data:

$ python fairseq_signals/data/ecg/preprocess/manifest.py \
    /path/to/data/ \
    --dest /path/to/manifest \
    --valid-percent $valid

For patient identification:

$ python fairseq_signals/data/ecg/preprocess/manifest_identification.py \
    /path/to/data \
    --dest /path/to/manifest \
    --valid-percent $valid

Please fine more details about pre-processing and data manifest from here.

For multi-modal tasks (Multi-modal pre-training or ECG question answering)

Prepare ECG dataset

We provide pre-processing codes for the following datasets.

Pre-process

For multi-modal pre-training of ECGs with reports from the PTB-XL dataset:

$ python fairseq_signals/data/ecg_text/preprocess/preprocess_ptbxl.py \
   /path/to/ptbxl \
   --dest /path/to/output \
   --meda-dir fairseq_signals/data/ecg_text/preprocess

For ECG Question Answering task:

$ python fairseq_signals/data/ecg_text/preprocess/preprocess_ecgqa.py \
    /path/to/ecgqa \
    --ptbxl-data-dir /path/to/ptbxl \
    --dest /path/to/output \
    --apply_paraphrase

You don't need to run additional scripts to prepare manifest files for ECG-QA dataset since it automatically generates manifest files during the pre-processing process.

Prepare data manifest

Given a directory that contains pre-processed PTB-XL data:

$ python fairseq_signals/data/ecg_text/preprocess/manifest.py \
    /path/to/data \
    --dest /path/to/manifest \
    --valid-percent $valid

Please fine more details about pre-processing and data manifest from here

Examples

We provide detailed READMEs for each model implementation:

* denotes for an official implementation

Contact

If you have any questions or recommendations, please contact us via an issue or an e-mail.

[email protected]

fairseq-signals's People

Contributors

Stargazers

Watchers

Forkers

aviatoryan kadenmc todochenxi hungrygeek16 ronamitv theabrusch

fairseq-signals's Issues

Error while pre-training

When I run the command for pre-training : fairseq-hydra-train \ task.data=/path/to/manifest/cmsc \ --config-dir examples/w2v_clocs/config/pretraining \ --config-name w2v_cmsc_rlm

I am getting this error:

Traceback (most recent call last):
  File "/home/pranav.prakhar/dev1/lib/python3.8/site-packages/fairseq_cli/hydra_train.py", line 27, in hydra_main
    _hydra_main(cfg)
  File "/home/pranav.prakhar/dev1/lib/python3.8/site-packages/fairseq_cli/hydra_train.py", line 56, in _hydra_main
    distributed_utils.call_main(cfg, pre_main, **kwargs)
  File "/home/pranav.prakhar/dev1/lib/python3.8/site-packages/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/home/pranav.prakhar/dev1/lib/python3.8/site-packages/fairseq_cli/train.py", line 88, in main
    task = tasks.setup_task(cfg.task)
  File "/home/pranav.prakhar/dev1/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'ecg_pretraining', 'data': 'manifest_cmsc/cmsc', 'perturbation_mode': ['random_leads_masking'], 'p': [1.0], 'mask_leads_selection': 'random', 'mask_leads_prob': 0.5, 'normalize': False, 'enable_padding': True, 'enable_padding_leads': False, 'leads_to_load': None}. Available argparse tasks: dict_keys(['speech_to_text', 'translation', 'simul_speech_to_text', 'simul_text_to_text', 'sentence_prediction', 'language_modeling', 'sentence_ranking', 'multilingual_masked_lm', 'speech_unit_modeling', 'audio_pretraining', 'audio_finetuning', 'translation_from_pretrained_xlm', 'text_to_speech', 'speech_to_speech', 'denoising', 'multilingual_denoising', 'hubert_pretraining', 'translation_from_pretrained_bart', 'online_backtranslation', 'frm_text_to_speech', 'translation_lev', 'multilingual_translation', 'cross_lingual_lm', 'legacy_masked_lm', 'masked_lm', 'semisupervised_translation', 'multilingual_language_modeling', 'sentence_prediction_adapters', 'translation_multi_simple_epoch', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['translation', 'simul_text_to_text', 'sentence_prediction', 'language_modeling', 'speech_unit_modeling', 'audio_pretraining', 'audio_finetuning', 'translation_from_pretrained_xlm', 'hubert_pretraining', 'translation_lev', 'masked_lm', 'multilingual_language_modeling', 'sentence_prediction_adapters', 'dummy_lm', 'dummy_masked_lm'])

I think it'a a problem with fairseq library but I am not able to figure it out.
Reference issues: https://github.com/facebookresearch/fairseq/issues/3683
https://github.com/facebookresearch/fairseq/issues/4717

@Jwoo5

StopIteration error during training when logging to CSV.

I am receiving a StopIteration error when selecting csv as the common.log_format:

File "fairseq_signals/logging/progress_bar.py", line 373, in _log_to_csv
    headers = next(iter(rd))
StopIteration

This occurs just after the first log message is made. It appears the CSV reader iterator is empty, where the CSV file presumably has no header and is empty.

Here is the train command used:

fairseq-hydra-train \
    task.data=manifest \
    dataset.valid_subset=valid \
    dataset.batch_size=100 \
    dataset.num_workers=30 \
    distributed_training.distributed_world_size=3 \
    checkpoint.save_dir=ckpts/3kg/pretraining \
    checkpoint.save_interval=10 \
    checkpoint.keep_last_epochs=0 \
    common.log_format=csv \
    --config-dir examples/3kg/config/pretraining/ecg_transformer \
    --config-name 3kg

It appears that this error only occurs during distributed training. When using distributed_training.distributed_world_size=1, training continues without issue and the CSV file is written to as expected. Also, when using distributed training without common.log_format=csv, everything works as expected.

It appears that distributed training will sometimes not break due to the StopIteration error, however each row is logged to the CSV the same number of times as the world size. Perhaps this is some kind of race condition error?

Although logging occurs in the train function in fairseq_cli/train.py, perhaps the is_data_parallel_master method in fairseq_signals/trainer.py could somehow be used to avoid this repeated logging issue. Perhaps this would also fix the StopIteration error.

Data problrm

I have another problem, How to load the data of ecg-id using python?

The error of Pre-train CMSC model

When I run fairseq-hydra-train task.data=clocs/cmsc --config-dir examples/cmsc/config/pretraining/ecg_transformer --config-name cmsc for Pre-training a new ECG Transformer model, I encounter the AttributeError: module 'importlib_resources' has no attribute 'is_resource' issue. How can I resolve this?

Error in 3KG criterion during distributed training

I am receiving an error during 3KG pretraining, which occurs only during distributed training.

Along with the error, I am displaying the hydra_main config in hydra_train.py in case it is of pertinence. Using 3 GPUs and batch size of 64, I obtain the error:

> fairseq-hydra-train task.data=./manifest/total_all_patient_ids --config-dir ./fairseq-signals/examples/3kg/config/pretraining/ecg_transformer --config-name 3kg

hydra_main - cfg
{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': json, 'log_file': None, 'wandb_project': None, 'wandb_entity': None, 'seed': 1, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'profile': False, 'reset_logging': False, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 3, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': 12355, 'device_id': 0, 'ddp_comm_hook': none, 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'fp16': '${common.fp16}', 'memory_efficient_fp16': '${common.memory_efficient_fp16}'}, 'dataset': {'_name': None, 'num_workers': 6, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': '', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': True, 'max_tokens_valid': '${dataset.max_tokens}', 'batch_size_valid': '${dataset.batch_size}', 'max_valid_steps': None, 'curriculum': 0, 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 200, 'max_update': 0, 'lr': [5e-05], 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'update_freq': [1], 'stop_min_lr': -1.0}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False}, 'model': {'_name': 'ecg_transformer', 'apply_mask': False, 'dropout_input': 0.1, 'dropout_features': 0.1, 'feature_grad_mult': 0.1, 'encoder_embed_dim': 768, 'in_d': 12}, 'task': {'_name': 'ecg_pretraining', 'data': './manifest/total_all_patient_ids', 'normalize': False, 'enable_padding': True, 'inferred_3kg_config': {'angle': 45, 'scale': 1.5, 'mask_ratio': 0.5}}, 'criterion': {'_name': '3kg'}, 'lr_scheduler': {'_name': 'fixed', 'warmup_updates': 0}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01}}
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 2): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 0): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | fairseq_signals.distributed.utils | disrtibuted init (rank 1): tcp://localhost:18583
2023-07-24 10:06:59 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 1
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 2
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 0
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 1
2023-07-24 10:07:00 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
2023-07-24 10:07:00 | INFO | fairseq_signals.distributed.utils | initialized host node136.uhnh4h.cluster as rank 2
[2023-07-24 10:07:09,327][fairseq_cli.train][INFO] - {'_name': None,
 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False},
 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'wandb_project': None, 'wandb_entity': None, 'seed': 1, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'profile': False, 'reset_logging': False, 'suppress_crashes': False},
 'common_eval': {'_name': None, 'path': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None},
 'criterion': {'_name': '3kg', 'temp': 0.1, 'eps': 1e-08},
 'dataset': {'_name': None, 'num_workers': 6, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 64, 'required_batch_size_multiple': 8, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': '', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': True, 'max_tokens_valid': None, 'batch_size_valid': 64, 'max_valid_steps': None, 'curriculum': 0, 'num_shards': 1, 'shard_id': 0},
 'distributed_training': {'_name': None, 'distributed_world_size': 3, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:18583', 'distributed_port': 12355, 'device_id': 0, 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'fp16': False, 'memory_efficient_fp16': False},
 'job_logging_cfg': {'version': 1, 'formatters': {'simple': {'format': '[%(asctime)s][%(name)s][%(levelname)s] - %(message)s'}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'simple', 'stream': 'ext://sys.stdout'}, 'file': {'class': 'logging.FileHandler', 'formatter': 'simple', 'filename': 'hydra_train.log'}}, 'root': {'level': 'INFO', 'handlers': ['console', 'file']}, 'disable_existing_loggers': False},
 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [5e-05]},
 'model': {'_name': 'ecg_transformer', 'normalize': False, 'filter': False, 'data': './manifest/total_all_patient_ids', 'args': None, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'layer_norm_first': False, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.1, 'dropout_features': 0.1, 'apply_mask': False, 'mask_length': 10, 'mask_prob': 0.0, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'extractor_mode': 'default', 'conv_feature_layers': '[(256, 2, 2)] * 4', 'in_d': 12, 'conv_bias': False, 'feature_grad_mult': 0.1, 'conv_pos': 128, 'conv_pos_groups': 16},
 'optimization': {'_name': None, 'max_epoch': 200, 'max_update': 0, 'lr': [5e-05], 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'update_freq': [1], 'stop_min_lr': -1.0},
 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, 'lr': [5e-05]},
 'task': {'_name': 'ecg_pretraining', 'data': './manifest/total_all_patient_ids', 'leads_to_load': None, 'leads_bucket': None, 'bucket_selection': 'uniform', 'sample_rate': None, 'filter': False, 'normalize': False, 'mean_path': None, 'std_path': None, 'enable_padding': True, 'enable_padding_leads': False, 'max_sample_size': None, 'min_sample_size': None, 'num_batch_buckets': 0, 'precompute_mask_indices': False, 'perturbation_mode': None, 'p': [1.0], 'max_amplitude': 0.1, 'min_amplitude': 0.0, 'dependency': True, 'shift_ratio': 0.2, 'num_segment': 1, 'max_freq': 0.2, 'min_freq': 0.01, 'k': 3, 'mask_leads_selection': 'random', 'mask_leads_prob': 0.5, 'mask_leads_condition': [4, 5], 'inferred_w2v_config': None, 'inferred_3kg_config': {'angle': 45, 'scale': 1.5, 'mask_ratio': 0.5}, 'criterion_name': '3kg', 'model_name': None, 'clocs_mode': None}}
[2023-07-24 10:07:11,099][fairseq_cli.train][INFO] - ECGTransformerModel(
  (dropout_input): Dropout(p=0.1, inplace=False)
  (dropout_features): Dropout(p=0.1, inplace=False)
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-11): 12 x TransformerEncoderLayer(
        (self_attn): MultiHeadAttention(
          (dropout): Dropout()
          (k_proj): Linear(in_features=768, out_features=768, bias=True)
          (v_proj): Linear(in_features=768, out_features=768, bias=True)
          (q_proj): Linear(in_features=768, out_features=768, bias=True)
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
        (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
        (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (feature_extractor): ConvFeatureExtraction(
    (conv_layers): ModuleList(
      (0): Sequential(
        (0): Conv1d(12, 256, kernel_size=(2,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(256, 256, eps=1e-05, affine=True)
        (3): GELU(approximate='none')
      )
      (1-3): 3 x Sequential(
        (0): Conv1d(256, 256, kernel_size=(2,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): GELU(approximate='none')
      )
    )
  )
  (post_extract_proj): Linear(in_features=256, out_features=768, bias=True)
  (conv_pos): ConvPositionalEncoding(
    (pos_conv): Sequential(
      (0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
      (1): SamePad()
      (2): GELU(approximate='none')
    )
  )
  (layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - task: ECGPretrainingTask
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - model: ECGTransformerModel
[2023-07-24 10:07:11,101][fairseq_cli.train][INFO] - criterion: ThreeKGCriterion
[2023-07-24 10:07:11,103][fairseq_cli.train][INFO] - num. shared model params: 90,373,248 (num. trained: 90,373,248)
[2023-07-24 10:07:11,104][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0)
[2023-07-24 10:07:11,175][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.1.0.bias
[2023-07-24 10:07:11,176][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.2.0.bias
[2023-07-24 10:07:11,176][fairseq_signals.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.3.0.bias
[2023-07-24 10:07:11,177][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2023-07-24 10:07:11,228][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 3 nodes.
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - ***********************CUDA enviroments for all 3 workers***********************
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank   0: capabilities =  8.0  ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe                   
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank   1: capabilities =  8.0  ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe                   
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - rank   2: capabilities =  8.0  ; total memory = 79.199 GB ; name = NVIDIA A100 80GB PCIe                   
[2023-07-24 10:07:11,478][fairseq_signals.utils.utils][INFO] - ***********************CUDA enviroments for all 3 workers***********************
[2023-07-24 10:07:11,478][fairseq_cli.train][INFO] - training on 3 devices (GPUs)
[2023-07-24 10:07:11,479][fairseq_cli.train][INFO] - max tokens per device = None and signals per device = 64
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - No existing checkpoint found ()
[2023-07-24 10:07:11,479][fairseq_signals.trainer][INFO] - loading train data for epoch 1
[2023-07-24 10:07:12,133][fairseq_signals.data.ecg.raw_ecg_dataset][INFO] - loaded 760412, skipped 0 samples
[2023-07-24 10:07:23,789][fairseq_signals.trainer][INFO] - NOTE: your device may support faster training with --fp16
[2023-07-24 10:07:23,799][fairseq_signals.trainer][INFO] - begin training epoch 1
[2023-07-24 10:07:23,800][fairseq_cli.train][INFO] - Start iterating over samples
Traceback (most recent call last):
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 107, in run
    return run_job(
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/hydra/core/utils.py", line 129, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/fairseq-signals/fairseq_cli/hydra_train.py", line 50, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/home/fairseq-signals/fairseq_signals/distributed/utils.py", line 126, in call_main
    torch.multiprocessing.spawn(
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/fairseq-signals/fairseq_signals/distributed/utils.py", line 112, in distributed_main
    main(cfg, **kwargs)
  File "/home/fairseq-signals/fairseq_cli/train.py", line 160, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/fairseq-signals/fairseq_cli/train.py", line 264, in train
    log_output = trainer.train_step(samples)
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/fairseq-signals/fairseq_signals/trainer.py", line 569, in train_step
    raise e
  File "/home/fairseq-signals/fairseq_signals/trainer.py", line 537, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/home/fairseq-signals/fairseq_signals/tasks/task.py", line 333, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/home/anaconda3/envs/ecg_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/fairseq-signals/fairseq_signals/criterions/3kg_criterion.py", line 73, in forward
    pos_mask = torch.masked_select(
RuntimeError: The size of tensor a (128) must match the size of tensor b (384) at non-singleton dimension 1

Again, the training works fine when using a single GPU. It appears that tensor dim(b) = dim(a) * n_gpu - this holds true for 2 GPUs, different batch sizes, etc., so perhaps the loss calculation is not adapted to function under a DistributedDataParallel setup?

3KG Pretraining Issue: Key 'final_dim' not in 'ECGTransformerConfig'

When trying to perform 3KG pretraining using the command fairseq-hydra-train task.data=./manifest/total --config-dir ./fairseq-signals/examples/3kg/config/pretraining/ecg_transformer --config-name 3kg, I am receiving the error:

omegaconf.errors.ConfigKeyError: Key 'final_dim' not in 'ECGTransformerConfig'
        full_key: final_dim
        reference_type=Optional[ECGTransformerConfig]
        object_type=ECGTransformerConfig

This error does disappear when the final_dim line of the 3kg.yaml file is commented out:

model:
  # final_dim: 256

Is this an issue with the config? Is commenting out this line the correct fix?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble