Comments (6)
@liuyibox Can you share the config?
from tensorflowasr.
@liuyibox Can you share the config?
Below is my config. By the way, I remove the mirrorStrategy in the code because I cannot make it run with the mirrorStrategy, so probably this is the cause of the error since the original code compile the graph with strategy.scope():
and I remove this strategy parts in the train.py file
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False
decoder_config:
vocabulary: /home/liuyi/TensorFlowASR/vocabularies/librispeech/librispeech_train_4_1030.subwords
target_vocab_size: 1000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add
learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/train-clean-100/transcripts.tsv
tfrecords_dir: null
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train
eval_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/dev-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval
test_dataset_config:
use_tf: True
data_paths:
- /home/liuyi/TensorFlowASR/dataset/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test
optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9
running_config:
batch_size: 8
num_epochs: 1
checkpoint:
filepath: /home/liuyi/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}
save_best_only: False
save_weights_only: False
save_freq: epoch
verbose: 1
states_dir: /home/liuyi/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /home/liuyi/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2
from tensorflowasr.
@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.
The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.
from tensorflowasr.
@liuyibox There're still some issues when using "save_weights_only: False" (I'm working on this). So you should use "save_weights_only: True" to only store the weights in the checkpoints.
The mirrorStrategy can work for 1GPU, if you have multiple gpus you can pass --devices=[0,1] to use only on gpu 0 and 1, or --devices=[0] or --devices=[1] to use single corresponding gpu.
Thanks @usimarit
Here is my train.py. With the mirrorstrategy, the training process waits forever after loading the cudnn library 3 time for the 3 GPU cards and does not proceed to the horizontal progress bar. During the forever waiting, the GPU memory and utilization are fully saturated even if I install the nccl. I think they are busy with something else. So I have to remove the strategy, any hints on why the mirrorstrategy keeps waiting and does not proceed with training? The current train.py can run with only the first GPU card, i.e., the [0].
import os
import fire
import math
from tensorflow_asr.utils import env_util
logger = env_util.setup_environment()
import tensorflow as tf
from tensorflow_asr.configs.config import Config
from tensorflow_asr.helpers import featurizer_helpers, dataset_helpers
from tensorflow_asr.models.transducer.conformer import Conformer
from tensorflow_asr.optimizers.schedules import TransformerSchedule
DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")
def main(
config: str = DEFAULT_YAML,
tfrecords: bool = False,
sentence_piece: bool = False,
subwords: bool = True,
bs: int = None,
spx: int = 1,
metadata: str = None,
static_length: bool = False,
devices: list = [0,1,2],
mxp: bool = True,
pretrained: str = None,
):
tf.keras.backend.clear_session()
# tf.config.optimizer.set_experimental_options({"auto_mixed_precision": mxp})
config = Config(config)
speech_featurizer, text_featurizer = featurizer_helpers.prepare_featurizers(
config=config,
subwords=subwords,
sentence_piece=sentence_piece,
)
train_dataset, eval_dataset = dataset_helpers.prepare_training_datasets(
config=config,
speech_featurizer=speech_featurizer,
text_featurizer=text_featurizer,
tfrecords=tfrecords,
metadata=metadata,
)
if not static_length:
speech_featurizer.reset_length()
text_featurizer.reset_length()
train_data_loader, eval_data_loader, global_batch_size = dataset_helpers.prepare_training_data_loaders(
config=config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
batch_size=bs,
)
conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
conformer.make(speech_featurizer.shape, prediction_shape=text_featurizer.prepand_shape, batch_size=global_batch_size)
if pretrained:
conformer.load_weights(pretrained, by_name=True, skip_mismatch=True)
conformer.summary(line_length=100)
optimizer = tf.keras.optimizers.Adam(
TransformerSchedule(
d_model=conformer.dmodel,
warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
max_lr=(0.05 / math.sqrt(conformer.dmodel)),
),
**config.learning_config.optimizer_config
)
conformer.compile(
optimizer=optimizer,
experimental_steps_per_execution=spx,
global_batch_size=global_batch_size,
blank=text_featurizer.blank,
run_eagerly=True,
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard),
]
conformer.fit(
train_data_loader,
epochs=config.learning_config.running_config.num_epochs,
validation_data=eval_data_loader,
callbacks=callbacks,
steps_per_epoch=train_dataset.total_steps,
validation_steps=eval_dataset.total_steps if eval_data_loader else None,
)
if __name__ == "__main__":
os.environ["CUDA_VISIBLE_DEVICES"]="1"
main()
from tensorflowasr.
@liuyibox Did the training pass to the stage where the model's summary is printed (when mirror strategy is applied)?
The run_eagerly=True
make the model training is not wrapped in tf.function
, therefore slow down the training, eagerly should be used for debugging only.
from tensorflowasr.
This issue is solved when I use "save_weights_only: True". I will open another issue for the mirrorstrategy training issue. Thank you.
from tensorflowasr.
Related Issues (20)
- unexpected truncation of the dataset HOT 2
- ValueError: Shape mismatch in layer #1 (named conformer_prediction) for weight conformer/conformer_prediction/conformer_prediction_embedding/embeddings:0 HOT 1
- Multi-GPU card training with MirrorStrategy wait forever after loading the cudnn HOT 1
- Empty TFLite output HOT 3
- rnn_transducer test error HOT 1
- rnn transducer inference
- RNNT loss HOT 1
- conformer.tflite increases the wer compared to tensorflow conformer HOT 3
- Unused dependency Pillow incompatible with python 3.6 HOT 2
- Is the Vietnamese dataset INFoRe a single speaker dataset HOT 2
- full int8 quantisation
- Issue about CharFeaturizer HOT 1
- Inference Problem with DeepSpeech2
- training problem with rnn_transducer HOT 8
- Masking in encoder HOT 2
- Can you make an example HOT 1
- Librispeech_train_4_1030.subword Not compatible HOT 5
- test.py not run with cpu=false HOT 4
- Still maintained? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.