GithubHelp home page GithubHelp logo

Comments (6)

nglehuy avatar nglehuy commented on June 4, 2024

@liuyibox Can you share the config?

from tensorflowasr.

liuyibox avatar liuyibox commented on June 4, 2024

@liuyibox Can you share the config?

Below is my config. By the way, I remove the mirrorStrategy in the code because I cannot make it run with the mirrorStrategy, so probably this is the cause of the error since the original code compile the graph with strategy.scope(): and I remove this strategy parts in the train.py file

speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False

decoder_config:
vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords
target_vocab_size: 1000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv

model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add

learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
tfrecords_dir: null
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train

eval_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval

test_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test

optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9

running_config:
batch_size: 8
num_epochs: 1
checkpoint:
filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}
save_best_only: False
save_weights_only: False
save_freq: epoch
verbose: 1
states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2

from tensorflowasr.

nglehuy avatar nglehuy commented on June 4, 2024

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

from tensorflowasr.

liuyibox avatar liuyibox commented on June 4, 2024

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

Thanks @usimarit
Here is my train.py. With the mirrorstrategy, the training process waits forever after loading the cudnn library 3 time for the 3 GPU cards and does not proceed to the horizontal progress bar. During the forever waiting, the GPU memory and utilization are fully saturated even if I install the nccl. I think they are busy with something else. So I have to remove the strategy, any hints on why the mirrorstrategy keeps waiting and does not proceed with training? The current train.py can run with only the first GPU card, i.e., the [0].

import os
import fire
import math
from tensorflow_asr.utils import env_util

logger = env_util.setup_environment()
import tensorflow as tf

from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule


DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")


def main(
    config: str = DEFAULT_YAML,
    tfrecords: bool = False,
    sentence_piece: bool = False,
    subwords: bool = True,
    bs: int = None,
    spx: int = 1,
    metadata: str = None,
    static_length: bool = False,
    devices: list = [0,1,2],
    mxp: bool = True,
    pretrained: str = None,
):
    tf.keras.backend.clear_session()
#    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})

    config = Config(config)

    speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
        config=config,
        subwords=subwords,
        sentence_piece=sentence_piece,
    )

    train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
        config=config,
        speech_featurizer=speech_featurizer,
        text_featurizer=text_featurizer,
        tfrecords=tfrecords,
        metadata=metadata,
    )

    if not static_length:
        speech_featurizer.reset_length()
        text_featurizer.reset_length()

    train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
        config=config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        batch_size=bs,
    )

    conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
    conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
    if pretrained:
        conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
    conformer.summary(line_length=100)
    optimizer = tf.keras.optimizers.Adam(
        TransformerSchedule(
            d_model=conformer.dmodel,
            warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
            max_lr=(0.05 / math.sqrt(conformer.dmodel)),
        ),
        **config.learning_config.optimizer_config
    )
    conformer.compile(
        optimizer=optimizer,
        experimental_steps_per_execution=spx,
        global_batch_size=global_batch_size,
        blank=text_featurizer.blank,
        run_eagerly=True,
    )

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
        tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
        tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
    ]

    conformer.fit(
        train_data_loader,
        epochs=config.learning_config.running_config.num_epochs,
        validation_data=eval_data_loader,
        callbacks=callbacks,
        steps_per_epoch=train_dataset.total_steps,
        validation_steps=eval_dataset.total_steps if eval_data_loader else None,
    )


if __name__ == "__main__":
    os.environ["CUDA_VISIBLE_DEVICES"]="1"
    main()

from tensorflowasr.

nglehuy avatar nglehuy commented on June 4, 2024

@liuyibox Did the training pass to the stage where the model's summary is printed (when mirror strategy is applied)?
The run_eagerly=True make the model training is not wrapped in tf.function, therefore slow down the training, eagerly should be used for debugging only.

from tensorflowasr.

liuyibox avatar liuyibox commented on June 4, 2024

This issue is solved when I use "save_weights_only: True". I will open another issue for the mirrorstrategy training issue. Thank you.

from tensorflowasr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.