TF-GPU 2.9 Ubuntu 22.04 Nvidia A30 x 3 python 3.8 When I

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor,about tensorspeech/tensorflowasr

Comments (6)

nglehuy commented on June 4, 2024

@liuyibox Can you share the config?

from tensorflowasr.

liuyibox commented on June 4, 2024

@liuyibox Can you share the config?

Below is my config. By the way, I remove the mirrorStrategy in the code because I cannot make it run with the mirrorStrategy, so probably this is the cause of the error since the original code compile the graph with strategy.scope(): and I remove this strategy parts in the train.py file

speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False

decoder_config:
vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords
target_vocab_size: 1000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv

model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add

learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
tfrecords_dir: null
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train

eval_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval

test_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test

optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9

running_config:
batch_size: 8
num_epochs: 1
checkpoint:
filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}
save_best_only: False
save_weights_only: False
save_freq: epoch
verbose: 1
states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2

from tensorflowasr.

nglehuy commented on June 4, 2024

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

from tensorflowasr.

liuyibox commented on June 4, 2024

@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.

The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.

Thanks @usimarit
Here is my train.py. With the mirrorstrategy, the training process waits forever after loading the cudnn library 3 time for the 3 GPU cards and does not proceed to the horizontal progress bar. During the forever waiting, the GPU memory and utilization are fully saturated even if I install the nccl. I think they are busy with something else. So I have to remove the strategy, any hints on why the mirrorstrategy keeps waiting and does not proceed with training? The current train.py can run with only the first GPU card, i.e., the [0].

import os
import fire
import math
from tensorflow_asr.utils import env_util

logger = env_util.setup_environment()
import tensorflow as tf

from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule


DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")


def main(
    config: str = DEFAULT_YAML,
    tfrecords: bool = False,
    sentence_piece: bool = False,
    subwords: bool = True,
    bs: int = None,
    spx: int = 1,
    metadata: str = None,
    static_length: bool = False,
    devices: list = [0,1,2],
    mxp: bool = True,
    pretrained: str = None,
):
    tf.keras.backend.clear_session()
#    tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})

    config = Config(config)

    speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
        config=config,
        subwords=subwords,
        sentence_piece=sentence_piece,
    )

    train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
        config=config,
        speech_featurizer=speech_featurizer,
        text_featurizer=text_featurizer,
        tfrecords=tfrecords,
        metadata=metadata,
    )

    if not static_length:
        speech_featurizer.reset_length()
        text_featurizer.reset_length()

    train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
        config=config,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        batch_size=bs,
    )

    conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
    conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
    if pretrained:
        conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
    conformer.summary(line_length=100)
    optimizer = tf.keras.optimizers.Adam(
        TransformerSchedule(
            d_model=conformer.dmodel,
            warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
            max_lr=(0.05 / math.sqrt(conformer.dmodel)),
        ),
        **config.learning_config.optimizer_config
    )
    conformer.compile(
        optimizer=optimizer,
        experimental_steps_per_execution=spx,
        global_batch_size=global_batch_size,
        blank=text_featurizer.blank,
        run_eagerly=True,
    )

    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
        tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
        tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
    ]

    conformer.fit(
        train_data_loader,
        epochs=config.learning_config.running_config.num_epochs,
        validation_data=eval_data_loader,
        callbacks=callbacks,
        steps_per_epoch=train_dataset.total_steps,
        validation_steps=eval_dataset.total_steps if eval_data_loader else None,
    )


if __name__ == "__main__":
    os.environ["CUDA_VISIBLE_DEVICES"]="1"
    main()

from tensorflowasr.

nglehuy commented on June 4, 2024

@liuyibox Did the training pass to the stage where the model's summary is printed (when mirror strategy is applied)?
The run_eagerly=True make the model training is not wrapped in tf.function, therefore slow down the training, eagerly should be used for debugging only.

from tensorflowasr.

liuyibox commented on June 4, 2024

This issue is solved when I use "save_weights_only: True". I will open another issue for the mirrorstrategy training issue. Thank you.

from tensorflowasr.

TypeError: Unable to serialize 144.0 to JSON. Unrecognized type <class 'tensorflow.python.framework.ops.EagerTensor about tensorflowasr HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs