GithubHelp home page GithubHelp logo

athena-team / athena Goto Github PK

View Code? Open in Web Editor NEW
942.0 37.0 194.0 10.18 MB

an open-source implementation of sequence-to-sequence based speech processing engine

Home Page: https://athena-team.readthedocs.io

License: Apache License 2.0

Python 28.24% Makefile 0.07% C++ 70.75% Shell 0.48% Dockerfile 0.07% CMake 0.16% C 0.23%
speech-recognition asr transformer tensorflow ctc unsupervised-learning sequence-to-sequence deployment wfst speaker-recognition

athena's People

Contributors

chenguoguo avatar cookingbear avatar dependabot[bot] avatar garygao99 avatar huang17 avatar hyx100e avatar jianweisun007 avatar leeyouxie avatar leixiaoning avatar neneluo avatar shuaijiang avatar shuaijiangke avatar some-random avatar studyself avatar teapoly avatar tjadamlee avatar trellixvulnteam avatar zouwei02 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

athena's Issues

error from kenlm in "pip install -r requirements.txt"

Hi, thanks for your previous suggestion. And i have delete the "horovod" from the requirements.txt file
But i also find another problem. The detail error:


Building wheel for kenlm (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-vn7hxo83
cwd: /tmp/pip-install-zq8e3scd/kenlm/
Complete output (12 lines):
running bdist_wheel
running build
running build_ext
building 'kenlm' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/util
creating build/temp.linux-x86_64-3.6/lm
creating build/temp.linux-x86_64-3.6/util/double-conversion
creating build/temp.linux-x86_64-3.6/python
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I. -I/home/luxy/demo/venv_athena/include -I/usr/include/python3.6m -c util/exception.cc -o build/temp.linux-x86_64-3.6/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -std=c++11
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

ERROR: Failed building wheel for kenlm
Running setup.py clean for kenlm
Failed to build kenlm
Installing collected packages: kenlm, jieba, pytz, python-dateutil, pandas
Running setup.py install for kenlm ... error
ERROR: Command errored out with exit status 1:
command: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm
cwd: /tmp/pip-install-zq8e3scd/kenlm/
Complete output (12 lines):
running install
running build
running build_ext
building 'kenlm' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/util
creating build/temp.linux-x86_64-3.6/lm
creating build/temp.linux-x86_64-3.6/util/double-conversion
creating build/temp.linux-x86_64-3.6/python
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I. -I/home/luxy/demo/venv_athena/include -I/usr/include/python3.6m -c util/exception.cc -o build/temp.linux-x86_64-3.6/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -std=c++11
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm Check the logs for full command output.


How to solove it? Thank you very much.

incorrect lm_path in aishell example config

lm_path in decode_config is incorrect in examples/asr/aishell/configs/mtl_transformer_sp.json.

examples/asr/aishell/rnnlm.json => examples/asr/aishell/configs/rnnlm.json

An installation issue

Hi,

When I was running the installation step:
pip3.7 install -r requirements.txt

I had some errors shown as below (it would be too long to paste all of them, so I pasted some screenshots here). I would be very grateful if you could have a look...Is it because I didn't configure the horovod environment right or I didn't successfully install mpi?

Requirement already satisfied: pyasn1>=0.1.3 in /home/pc21/venv_athena/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Building wheels for collected packages: horovod, librosa, kenlm, jieba, psutil, pyyaml, audioread, resampy
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/pc21/venv_athena/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4e373ylc
cwd: /tmp/pip-install-xp_is8ug/horovod/

After many lines:

x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fdebug-prefix-map=/build/python3.7-1t2gIN/python3.7-3.7.0~b3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-xp_is8ug/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()

Then:

File "/usr/lib/python3.7/subprocess.py", line 453, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.7/subprocess.py", line 756, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1499, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

INFO: Cannot find MPI compilation flags, will skip compiling with MPI.

raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.

then

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for librosa (setup.py) ... done
Created wheel for librosa: filename=librosa-0.7.2-py3-none-any.whl size=1612883 sha256=862bd06e9c89bd1f80e6e702b637d5ddf048093d4fd402073d13aebbacdc0799
Stored in directory: /tmp/pip-ephem-wheel-cache-x3z9ygim/wheels/18/9e/42/3224f85730f92fa2925f0b4fb6ef7f9c5431a64dfc77b95b39
Building wheel for kenlm (setup.py) ... error
ERROR: Command errored out with exit status 1:

call() got an unexpected keyword argument 'training'

384 [1,0]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7fda720b4b70>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7fda720b4b70>>: AssertionError: Bad argument number for Name: 3, expecting 4
385 [1,0]:2020-04-05 01:25:48.780970: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
386 [1,1]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4
387 [1,1]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4
388 [1,1]:2020-04-05 01:25:48.903736: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
389 [1,1]:Traceback (most recent call last):
390 [1,1]: File "athena/horovod_main.py", line 42, in
391 [1,1]: train(json_file, HorovodSolver, hvd.size(), hvd.rank())
392 [1,1]: File "/qssd/athena/athena/main.py", line 117, in train
393 [1,1]: p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
394 [1,1]: File "/qssd/athena/athena/main.py", line 105, in build_model_from_jsonfile
395 [1,1]: solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
396 [1,1]: File "/qssd/athena/athena/solver.py", line 96, in evaluate_step
397 [1,1]: logits = self.model(samples, training=False)
398 [1,1]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
399 [1,1]: outputs = self.call(inputs, *args, **kwargs)
400 [1,1]: File "/qssd/athena/athena/models/mtl_seq2seq.py", line 69, in call
401 [1,1]: self.ctc_logits = self.decoder(encoder_output, training=training)
402 [1,1]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
403 [1,1]: outputs = self.call(inputs, *args, **kwargs)
404 [1,1]:TypeError: call() got an unexpected keyword argument 'training'
405 [1,0]:Traceback (most recent call last):
406 [1,0]: File "athena/horovod_main.py", line 42, in
407 [1,0]: train(json_file, HorovodSolver, hvd.size(), hvd.rank())
408 [1,0]: File "/qssd/athena/athena/main.py", line 117, in train
409 [1,0]: p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
410 [1,0]: File "/qssd/athena/athena/main.py", line 105, in build_model_from_jsonfile
411 [1,0]: solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
412 [1,0]: File "/qssd/athena/athena/solver.py", line 96, in evaluate_step
413 [1,0]: logits = self.model(samples, training=False)
414 [1,0]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
415 [1,0]: outputs = self.call(inputs, *args, **kwargs)
416 [1,0]: File "/qssd/athena/athena/models/mtl_seq2seq.py", line 69, in call
417 [1,0]: self.ctc_logits = self.decoder(encoder_output, training=training)
418 [1,0]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
419 [1,0]: outputs = self.call(inputs, *args, **kwargs)
420 [1,0]:TypeError: call() got an unexpected keyword argument 'training'

installation is complete, and well done.
tensorflow 2.0.0b0
CUDA 10.0.0

Accelerate decoding

Beam search with CTC joint decoding is really slow, we need to accelerate it. Two solutions on top of my head:

1 split test set into smaller pieces then divide-and-conquer with horovod
2 use tf-function compatible code to rewrite CTC joint decoding part

@cookingbear please look it this

install error: Running setup.py install for horovod ... error

Hi,
my Install Environment:
VMware15.0, Ubuntu 18.04, python3.6.9

Error:*
Running setup.py install for horovod ... error
ERROR: Command errored out with exit status 1:
command: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod
cwd: /tmp/pip-install-s_f3vxnr/horovod/
Complete output (190 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.6/horovod
creating build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/keras
creating build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
creating build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/run
creating build/lib.linux-x86_64-3.6/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.6/horovod/spark
creating build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
creating build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
creating build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.6/horovod/_keras
creating build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.6/horovod/run/task
creating build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/run/http
creating build/lib.linux-x86_64-3.6/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common
creating build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/run/util
creating build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
creating build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
creating build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
creating build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
creating build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
creating build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
creating build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
creating build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
running build_ext
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.so
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 622, in get_common_options
    mpi_flags = get_mpi_flags()
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 354, in get_mpi_flags
    '%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
    shlex.split(show_command), universal_newlines=True).strip()
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'


INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1566, in <module>
    scripts=['bin/horovodrun'])
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/install.py", line 61, in run
    return orig.install.run(self)
  File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
    self.run_command('build')
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 87, in run
    _build_ext.run(self)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1457, in build_extensions
    options = get_common_options(self)
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 635, in get_common_options
    raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod Check the logs for full command output.

How to solve it?

How to Transfer Learning?

Hi, I wonder how to use a pretrained model for a new dataset? The new data is a total different one.
Am I supposed to use the MPC model? If so, do I need to recalculate cmvn for the new dataset?
Or can I skip the MPC stage,and finetune the latter model with new data? i.e, SpeechTransformer, and RNNLM ?
Please give me some guidance. Thanks a lot.

A little bug in decode code while restoring ckpt

Checkpoint saves files from index 1 by default, whereas the restoring code read ckpt file from index 0.In the athena/decode_main.py, the 55th line code could be modified with 'ckpt_path = p.ckpt + 'ckpt-' + str(idx + 1)'

Mix-precisioned training

Hi. We want to use it in mix-precisioned mode, as our GPU don't have much memory, and we want to speed up the training.

I change the code to use mix-precisioned training feature in TF2. It works for MPC (stage 1).
But for the fine-tuning stage, the loss becomes nan at the very beginning.
I try to debug it, and find out the PositionalEncoding in speech_transformer.py is always returning NaN.

        input_labels = layers.Input(shape=data_descriptions.sample_shape["output"], dtype=tf.int32)
        inner = layers.Embedding(self.num_class, d_model)(input_labels)
        inner = PositionalEncoding(d_model, scale=True)(inner) #it returns NaN
        inner = layers.Dropout(self.hparams.rate)(inner)
        self.y_net = tf.keras.Model(inputs=input_labels, outputs=inner, name="y_net")

could anyone help? Thanks a lot

horovod terminated.

Pretraining
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= Parse Parameter ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= HorovodSolver Init ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= Start Train ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:======================= build model from json file ======================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr  8 12:05:13 2020[0]<stdout>:Fbank params:  [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.813653: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.828699: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399915000 Hz
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833653: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aa3912c090 executing computations on platform Host. Devices:
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833693: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Wed Apr  8 12:05:15 2020[0]<stdout>:Model: "x_net"
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:Layer (type)                 Output Shape              Param #
Wed Apr  8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr  8 12:05:15 2020[0]<stdout>:input_1 (InputLayer)         [(None, None, 40, 1)]     0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:conv2d (Conv2D)              (None, None, 20, 512)     4608
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:batch_normalization (BatchNo (None, None, 20, 512)     2048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6 (TensorFlo [(None, None, 20, 512)]   0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:conv2d_1 (Conv2D)            (None, None, 10, 512)     2359296
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:batch_normalization_1 (Batch (None, None, 10, 512)     2048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6_1 (TensorF [(None, None, 10, 512)]   0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:reshape (Reshape)            (None, None, 5120)        0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:dense (Dense)                (None, None, 512)         2621952
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:positional_encoding (Positio (None, None, 512)         0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:dropout (Dropout)            (None, None, 512)         0
Wed Apr  8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr  8 12:05:15 2020[0]<stdout>:Total params: 4,989,952
Wed Apr  8 12:05:15 2020[0]<stdout>:Trainable params: 4,987,904
Wed Apr  8 12:05:15 2020[0]<stdout>:Non-trainable params: 2,048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:None
Wed Apr  8 12:05:16 2020[0]<stderr>:INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mpc
Wed Apr  8 12:05:20 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr  8 12:05:20 2020[0]<stdout>:Fbank params:  [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr  8 12:05:21 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr  8 12:05:21 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:Creates the sub-dataset which is the 0 part of 1
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:22 2020[0]<stderr>:>>>>> start training in epoch 0===============================
Wed Apr  8 12:05:22 2020[0]<stderr>:
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:please be patient, enable tf.function, it takes time ...
Process 0 exit with status code 249.
Traceback (most recent call last):
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/bin/horovodrun", line 21, in <module>
    run_commandline()
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
    _run(args)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
    _launch_job(args, remote_host_names, settings, common_intfs, command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
    gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
    _launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
    .format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 249

redefine SpeechTransformer2

@leixiaoning using the new approach to implement the SpeechTransformer2 call function, in which we forward twise: the first time forward is used to generate the predicted target, the second forward is used to intergate the ground truch, so that we can enable the schedule sampling

Problem: Ran out of memory

Hi all,

I used this commit b7b2d91, and trained the aishell example. The occupied memory of GPU went larger and larger with global_steps, finally it ran out of memory.

os: ubuntu 16.04
tensorflow: 2.0.1

Thank you in advance.

a little bug in aishell example

In the file examples/asr/aishell/local/prepare_data.py,the 47th line code:if not gfile.Exists(os.path.join(dataset_dir, subset)),I think the variable dataset_dir should be replaced to audio_dir.

thchs30 decode error

when I run thchs30 (like example aishell), I got an error during decoding , but procedure Fine-turning and Training language model are successful.
error message:

 ERROR:tensorflow:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f77941ae910>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "athena/decode_main.py", line 89, in <module>
    decode(jsonfile, n=5, log_file='nohup_thchs30.out')  
File "athena/decode_main.py", line 73, in decode
    solver.decode(dataset_builder.as_dataset(batch_size=1))  
File "/raid/BH/mitom/athena/athena/solver.py", line 149, in decode
    predictions = self.model.decode(samples, self.hparams, lm_model=self.lm_model)  
File "/raid/BH/mitom/athena/athena/models/mtl_seq2seq.py", line 109, in decode 
  history_predictions.write(0, last_predictions) 
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/tf_should_use.py", line 237, in wrapped
    error_in_function=error_in_function)

decoding output

We have to do couple of things regarding decoding output:

  1. Add configuration to write file to disk;

  2. Compare WER/CER with standard tools.

Assigning tasks to myself for now.

TIMIT open wav: wave.Error: file does not start with RIFF id

When I run prepare_data.py for TIMIT, function: get_wave_file_length(wav_file), I got a error below:
Traceback (most recent call last):
File " - examples/asr/timit/local/prepare_data.py", line 116, in
processor(DATASET_DIR, SUBSET, True, OUTPUT_DIR)
File " - examples/asr/timit/local/prepare_data.py", line 100, in processor
convert_audio_and_split_transcript(dataset_dir, subset, subset_csv)
File "- /examples/asr/timit/local/prepare_data.py", line 59, in convert_audio_and_split_transcript
files_size_dict[wav_file] = get_wave_file_length(wav_file)
File "- /athena/utils/misc.py", line 103, in get_wave_file_length
with wave.open(wave_file) as wav_file:
File "/usr/lib/python3.5/wave.py", line 499, in open
return Wave_read(f)
File "/usr/lib/python3.5/wave.py", line 163, in init
self.initfp(f)
File "/usr/lib/python3.5/wave.py", line 130, in initfp
raise Error('file does not start with RIFF id')
wave.Error: file does not start with RIFF id

don't surport RIFF?

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Hi, I use mtl_transformer_sp.json for Fine-tuning stage. It finished the epoch 0, but error occurs at some point of epoch 1.
BTW, when I run it without "speed_permutation": [0.9, 1.0, 1.1], it works.

Here's the command:
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/seewo/configs/mtl_transformer_sp.json
Here's the config:
`
{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/",
"summary_dir":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/event",

"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},

"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/mine/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},

"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_weight":0.7,
"lm_type": "rnn",
"lm_path":"examples/asr/mine/configs/rnnlm.json"
},

"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},

"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv": "examples/asr/mine/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"speed_permutation": [0.9, 1.0, 1.1],
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"}
}
}

`
And here's the log:

[1,0]:INFO:absl:global_steps: 15314 learning_rate: 1.7122e-04 loss: 3.6860 Accuracy: 0.8822 CTCAccuracy: 0.7879 sec/iter: 0.7007
[1,0]:INFO:absl:global_steps: 15324 learning_rate: 1.7133e-04 loss: 9.4535 Accuracy: 0.8708 CTCAccuracy: 0.7742 sec/iter: 0.6727
[1,0]:INFO:absl:global_steps: 15334 learning_rate: 1.7144e-04 loss: 5.3108 Accuracy: 0.8724 CTCAccuracy: 0.7839 sec/iter: 0.6807
[1,0]:INFO:absl:global_steps: 15344 learning_rate: 1.7155e-04 loss: 21.0516 Accuracy: 0.8447 CTCAccuracy: 0.7498 sec/iter: 0.6419
[1,0]:INFO:absl:global_steps: 15354 learning_rate: 1.7166e-04 loss: 11.5386 Accuracy: 0.8252 CTCAccuracy: 0.7421 sec/iter: 0.6913
[1,0]:INFO:absl:global_steps: 15364 learning_rate: 1.7177e-04 loss: 9.2761 Accuracy: 0.8529 CTCAccuracy: 0.7660 sec/iter: 0.7714
[1,0]:INFO:absl:global_steps: 15374 learning_rate: 1.7189e-04 loss: 6.6673 Accuracy: 0.8661 CTCAccuracy: 0.7800 sec/iter: 0.8484
[1,0]:INFO:absl:global_steps: 15384 learning_rate: 1.7200e-04 loss: 8.5238 Accuracy: 0.8668 CTCAccuracy: 0.7932 sec/iter: 0.8024
[1,0]:INFO:absl:global_steps: 15394 learning_rate: 1.7211e-04 loss: 10.3194 Accuracy: 0.8655 CTCAccuracy: 0.7854 sec/iter: 0.6500
[1,0]:INFO:absl:global_steps: 15404 learning_rate: 1.7222e-04 loss: 7.6797 Accuracy: 0.8471 CTCAccuracy: 0.7623 sec/iter: 0.6240
[1,1]:2020-03-28 02:04:52.312873: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]:2020-03-28 02:04:52.312922: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[1,1]:[ef53b0d505e4:40836] *** Process received signal ***
[1,1]:[ef53b0d505e4:40836] Signal: Aborted (6)
[1,1]:[ef53b0d505e4:40836] Signal code: (-6)
[1,1]:[ef53b0d505e4:40836] [ 0] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f223a69ef20]
[1,1]:[ef53b0d505e4:40836] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f223a69ee97]
[1,1]:[ef53b0d505e4:40836] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f223a6a0801]
[1,1]:[ef53b0d505e4:40836] [ 3] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x88d59b4)[0x7f2187b219b4]
[1,1]:[ef53b0d505e4:40836] [ 4] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f2187a8d357]
[1,1]:[ef53b0d505e4:40836] [ 5] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f2187a8dbef]
[1,1]:[ef53b0d505e4:40836] [ 6] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f217e5718b1]
[1,1]:[ef53b0d505e4:40836] [ 7] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f217e56efa8]
[1,1]:[ef53b0d505e4:40836] [ 8] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0x167b7cf)[0x7f217ebc87cf]
[1,1]:[ef53b0d505e4:40836] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f223a4486db]
[1,1]:[ef53b0d505e4:40836] [10] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f223a78188f]
[1,1]:[ef53b0d505e4:40836] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node ef53b0d505e4 exited on signal 6 (Aborted).

Error when run in pure CPU machine

Traceback (most recent call last):
File "athena/main.py", line 171, in
BaseSolver.initialize_devices(p.solver_gpu)
File "/media/runyu/D/works/algorithm/ASR/athena/athena/solver.py", line 54, in initialize_devices
assert len(gpus) > len(visible_gpu_idx)
AssertionError

Error occurred on following codes:
@staticmethod
def initialize_devices(visible_gpu_idx=None):
""" initialize hvd devices, should be called firstly """
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus is not None:
assert len(gpus) > len(visible_gpu_idx)
for idx in visible_gpu_idx:
tf.config.experimental.set_visible_devices(gpus[idx], "GPU")

the reason is the value of gpus is "[]" not "None" when running on pure CPU machine.
so "assert len(gpus) > len(visible_gpu_idx)" called but visible_gpu_idx is "None" and has no len() at this time.

error from: pip install -r requirements

environment: GCC=8.3.0, python=3.7.4
Error:
(venv_athena) (base) root@12e6d012d4d5:~/luxy/athena# pip install -r requirements.txt
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: tensorflow-gpu==2.0.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: sox in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (1.3.7)
Requirement already satisfied: absl-py in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.9.0)
Requirement already satisfied: yapf in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.29.0)
Requirement already satisfied: pylint in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (2.4.4)
Requirement already satisfied: flake8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (3.7.9)
Collecting horovod
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c0/31/dae1f224a284ccaf0fd700565a53658bfba9c3d5964719305953e72a11e0/horovod-0.19.1.tar.gz (2.9 MB)
Collecting tqdm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/1c/6359be64e8301b84160f6f6f7936bbfaaa5e9a4eab6cbc681db07600b949/tqdm-4.45.0-py2.py3-none-any.whl (60 kB)
Collecting sentencepiece
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/11/e0/1264990c559fb945cfb6664742001608e1ed8359eeec6722830ae085062b/sentencepiece-0.1.85-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
Processing /root/.cache/pip/wheels/6e/d3/47/7582e7e63ee9127f4773adeb8dcd8490771c063e2607354ba0/librosa-0.7.2-py3-none-any.whl
Collecting kenlm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/57/54/0cc492b8d7aceb17a9164c6e6b9c9afc2c73706bb39324e8f6fa02f7134a/kenlm-0.tar.gz (1.4 MB)
Processing /root/.cache/pip/wheels/95/1a/6d/75355e7a5c76ed48e2d6cde3b95c4828e83274b93f5392ac96/jieba-0.42.1-py3-none-any.whl
Collecting pandas
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
Requirement already satisfied: keras-applications>=1.0.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: protobuf>=3.6.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.11.3)
Requirement already satisfied: tensorflow-estimator<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: termcolor>=1.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: tensorboard<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: google-pasta>=0.1.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.0)
Requirement already satisfied: wrapt>=1.11.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.12.1)
Requirement already satisfied: grpcio>=1.8.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.28.1)
Requirement already satisfied: gast==0.2.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.2)
Requirement already satisfied: numpy<2.0,>=1.16.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.18.2)
Requirement already satisfied: six>=1.10.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.14.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.0)
Requirement already satisfied: astor>=0.6.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.8.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.34.2)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: mccabe<0.7,>=0.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (0.6.1)
Requirement already satisfied: isort<5,>=4.2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (4.3.21)
Requirement already satisfied: astroid<2.4,>=2.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (2.3.3)
Requirement already satisfied: pycodestyle<2.6.0,>=2.5.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.5.0)
Requirement already satisfied: pyflakes<2.2.0,>=2.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.1.1)
Requirement already satisfied: entrypoints<0.4.0,>=0.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (0.3)
Requirement already satisfied: cloudpickle in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: psutil in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.7.0)
Requirement already satisfied: pyyaml in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.3.1)
Requirement already satisfied: cffi>=1.4.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.14.0)
Processing /root/.cache/pip/wheels/ad/c3/72/f5733d5e4abc9a637c9f6834a1a29429b4cd57b30a4585f91a/resampy-0.2.2-py3-none-any.whl
Processing /root/.cache/pip/wheels/0a/af/f6/aa7eefaad4a35a4f78adbfa0c2a99c53fda489e48132b037e4/audioread-2.1.8-py3-none-any.whl
Collecting decorator>=3.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/ed/1b/72a1821152d07cf1d8b6fce298aeb06a7eb90f4d6d41acec9861e7cc6df0/decorator-4.4.2-py2.py3-none-any.whl (9.2 kB)
Collecting numba>=0.43.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a6/91/3af4fcbe6f9c05f5d04d08b955f635fc9e3388b751a7f0af18e71809e10a/numba-0.48.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB)
Collecting soundfile>=0.9.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting scikit-learn!=0.19.0,>=0.14.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/41/b6/126263db075fbcc79107749f906ec1c7639f69d2d017807c6574792e517e/scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
Collecting scipy>=1.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Collecting joblib>=0.12
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting pytz>=2017.2
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509 kB)
Collecting python-dateutil>=2.6.1
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Requirement already satisfied: h5py in /root/luxy/venv_athena/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.10.0)
Requirement already satisfied: setuptools in /root/luxy/venv_athena/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (40.8.0)
Requirement already satisfied: requests<3,>=2.21.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.23.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.13.1)
Requirement already satisfied: markdown>=2.6.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.1)
Requirement already satisfied: werkzeug>=0.11.15 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.1)
Requirement already satisfied: lazy-object-proxy==1.4.* in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.3)
Requirement already satisfied: typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.1)
Requirement already satisfied: pycparser in /root/luxy/venv_athena/lib/python3.7/site-packages (from cffi>=1.4.0->horovod->-r requirements.txt (line 7)) (2.20)
Collecting llvmlite<0.32.0,>=0.31.0dev0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a0/10/d02c0ac683fc47ecda3426249509cf771d748b6a2c0e9d5ebbee76a7b80a/llvmlite-0.31.0-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.25.8)
Requirement already satisfied: idna<3,>=2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.8)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.1.0)
Building wheels for collected packages: horovod, kenlm
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /root/luxy/venv_athena/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-f6z4f15k
cwd: /tmp/pip-install-j394y9lx/horovod/
Complete output (190 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.7/horovod
creating build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
creating build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
creating build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.7/horovod/common
creating build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.7/horovod/run
creating build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.7/horovod/spark
creating build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.7/horovod/_keras
creating build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/keras
creating build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/torch
creating build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.7/horovod/run/task
creating build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/util
creating build/lib.linux-x86_64-3.7/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common
creating build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.7/horovod/run/http
creating build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
creating build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
creating build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
creating build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
creating build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
creating build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
creating build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
creating build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
running build_ext
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.so
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 622, in get_common_options
mpi_flags = get_mpi_flags()
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 354, in get_mpi_flags
'%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 1566, in
scripts=['bin/horovodrun'])
File "/root/luxy/venv_athena/lib/python3.7/site-packages/setuptools/init.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/root/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/luxy/venv_athena/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 223, in run
self.run_command('build')
File "/root/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/luxy/venv_athena/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 78, in run
_build_ext.run(self)
File "/root/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 1457, in build_extensions
options = get_common_options(self)
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 635, in get_common_options
raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for kenlm (setup.py) ... \

How to solove it?

asr decoding is too slow.

The current decoding is much too slow. It almost takes 5 seconds for a single utterance decoding. I found there was a pull request ongoing for the decoding optimization. When will it be merged into master branch?

Validation scripts

We need scripts to validate things such as data directory structure. Otherwise we won't know if certain step fails. I'll assign it to myself for now but may take some time to get back to this.

Memory usage optimization

I had a discussion with @tjadamlee , we think our default examples' GPU memory usage is way too high, which is blocking a lot of users. We should tune the parameters (e.g., batch size) as well as the model structures, to keep the memory usage under 8G for default example setups.

@Some-random could you please take the lead and cut all examples' memory usage under 8G, while keeping the performance as much as possible?

cuDNN failed to initialize

806c16f

Hello, genius developer
I am a shenlan student. I installed it according to the instructions. No errors were reported in the middle, and I used the following code to verify that the translation model training is correct.
However, when I train the asr model in the examples/asr/aishell_sub/ directory and run to Fine-tuning, an error is reported.

Traceback (most recent call last):
File "athena/main.py", line 172, in
train(json_file, BaseSolver, 1, 0)
File "athena/main.py", line 117, in train
p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
File "athena/main.py", line 105, in build_model_from_jsonfile
solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
File "/home/hanzl/work/learn/athena/athena/solver.py", line 95, in evaluate_step
logits = self.model(samples, training=False)
...
...
File "/home/hanzl/work/learn/athena/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

It prompts me to go to the error message above, I found the following information:

2020-04-06 14:48:17.829495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:b1:00.0, compute capability: 7.5)
INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mtl_transformer_ctc/
2020-04-06 14:48:21.230124: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-06 14:48:22.942435: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.0 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

Can anyone tell me why this is and how should I solve it? I look forward to your reply, thank you very much!

save log files during training and decoding

Currently users have to use redirect instructions which are easily to be forgotten to save logs and the log is critical to be used to extract n-best checkpoints during decoding. So the mechanism to automatically save logs may be needed.

aishell scripts

Update the aishell examples.

  1. We should start with downloading data in prepare_data.py
  2. When run the run.sh, we may get some results, please write down those results in README
  3. ...

A bug in rnnlm.json in librispeech example

"dataset_builder": "language_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/train-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
},
"devset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/test-clean-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
}
}

python script encoding suggestion

the python script examples/asr/aishell/local/prepare_data.py
when I run in a chinese code envirionment, cannot run.
So I suggest add the following code to avoid the above problem:

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

seems deadlock occurs when using multithread

Here's the mtl_transformer.json for finetuning stage. I set the num_data_threads:32, and it hangs up at the end of the first training epoch.
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/hkust/configs/mtl_transformer.json

{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/hkust/ckpts/mtl_transformer_ctc/",
"summary_dir":"examples/asr/hkust/ckpts/mtl_transformer_ctc/event",

"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},

"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/hkust/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},

"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_type":"ngram",
"lm_weight":0.3,
"lm_path":"examples/asr/hkust/data/5gram.arpa"
},

"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},

"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 12,
"trainset_config":{
"data_csv": "examples/asr/hkust/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"}
}
}

Here's part of the strace log of one of the process:

restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=448417000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=453560000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=458726000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=463892000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=469063000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=474251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=479418000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=484583000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=489748000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=494910000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa745ac, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=530785000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=535930000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=541074000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=546212000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=551297000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=556410000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=561516000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=566631000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=571733000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=576836000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=581946000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=587075000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=592220000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=597350000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=602524000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=607693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=612850000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=617997000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=623167000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=628321000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=633475000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=638630000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=643783000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=648937000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=654094000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=659251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=664403000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=669553000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=674704000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=679851000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=684998000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=690118000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=695268000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=700402000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=705515000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=710628000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=715747000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=720839000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=725960000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=731070000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=736199000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=741361000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=746488000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=751589000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=756712000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=761827000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=766948000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=772084000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0

Error of decoding stage

Hi,
When I was running the decoding stage, I got such error message:

<<<
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/array_ops.py", line 1431, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1249, in concat_v2
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in
NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values
{ list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat

horovod.tensorflow

When I typed import horovod.tensorflow, it occured an error like that:
Traceback (most recent call last):
File "", line 1, in
File "/adddisk/zhangjin/projects/horovod/horovod/tensorflow/init.py", line 25, in
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', file, 'mpi_lib')
File "/adddisk/zhangjin/projects/horovod/horovod/common/util.py", line 51, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.tensorflow has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.

aishell decode error: tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2'....

when I run aishell decoding scripts independently, Error message below:


......
None
best_wer_checkpoint:
[]
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 1517, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1118, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
......

docker

Building some docker environment

'python examples/asr/aishell/local/prepare_data.py' failed

image

$ python examples/asr/aishell/local/prepare_data.py
Traceback (most recent call last):
File "examples/asr/aishell/local/prepare_data.py", line 25, in
from athena import get_wave_file_length
File "/workspace/users/lpp/source/athena/athena/init.py", line 18, in
from .data import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/init.py", line 18, in
from .datasets.speech_recognition import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/datasets/speech_recognition.py", line 22, in
from athena.transform import AudioFeaturizer
File "/workspace/users/lpp/source/athena/athena/transform/init.py", line 16, in
from athena.transform import audio_featurizer
File "/workspace/users/lpp/source/athena/athena/transform/audio_featurizer.py", line 19, in
from athena.transform import feats
File "/workspace/users/lpp/source/athena/athena/transform/feats/init.py", line 16, in
from athena.transform.feats.read_wav import ReadWav
File "/workspace/users/lpp/source/athena/athena/transform/feats/read_wav.py", line 21, in
from athena.transform.feats.ops import py_x_ops
File "/workspace/users/lpp/source/athena/athena/transform/feats/ops/py_x_ops.py", line 28, in
spectrum = gen_x_ops.spectrum
AttributeError: module '5fa89fc3154996733eabb433e18fa62f' has no attribute 'spectrum'

I will appreciate it for any help.

assert len(gpus) > len(visible_gpu_idx)

Wed Apr 1 08:54:43 2020[0]:name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.759
Wed Apr 1 08:54:43 2020[0]:pciBusID: 0000:01:00.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.171558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.173728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.178045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.179579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.184595: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.185004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
Wed Apr 1 08:54:43 2020[0]:Traceback (most recent call last):
Wed Apr 1 08:54:43 2020[0]: File "athena/main.py", line 171, in
Wed Apr 1 08:54:43 2020[0]: BaseSolver.initialize_devices(p.solver_gpu)
Wed Apr 1 08:54:43 2020[0]: File "/mnt/3T/mygits/ASR-NLP/FrameWorks/athena/athena/solver.py", line 54, in initialize_devices
Wed Apr 1 08:54:43 2020[0]: assert len(gpus) > len(visible_gpu_idx)
Wed Apr 1 08:54:43 2020[0]:AssertionError
Process 0 exit with status code 1.
Traceback (most recent call last):
File "/home/wcl/anaconda3/bin/horovodrun", line 21, in
run_commandline()
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
_run(args)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
_launch_job(args, remote_host_names, settings, common_intfs, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
_launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
.format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

I'm trying single GPU and close hvd。

how to speed up cmvn computation

image
image
image
image

seems the cmvn computation stage is single process, it takes days for our dataset, around 17000 hour, and mostly 60 secs per clip.

is it possible to make it run parallelly ?

Lot's of read -1 errors dumped during aishell CTC training

There were quite a lot of read errors dumped during the training process. However, it seemed no impact to the training. Not sure if you ever got such annoying errors. Anything wrong with my training set?

....
[1,7]:INFO:absl:perform batch_wise_shuffle with batch_size 16
[1,2]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,0]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:[373f86b536f2:00117] Read -1, expected 5393, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 6345, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 5873, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 5184, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 5184, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 5440, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 4864, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 4928, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 131072, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 20480, errno = 1
[1,0]:INFO:absl:global_steps: 38522 learning_rate: 2.2517e-04 loss: 1.5734 CTCAccuracy: 0.9167 Accuracy: 0.9323
[1,7]:[373f86b536f2:00124] Read -1, expected 5248, errno = 1
...

integrate https://github.com/athena-team/athena-decoder

start a new branch to integrate athena-decoder into athena project. @godjealous

  1. Currently, we can consider the athena-decoder as a seperate project, and install it using pip.
  2. In athena, we only use the provided interface in athena-decode.
  3. Evaluate the decoder from two aspects: the speed, and the CER
    In the future, we may update athena-decoder and athena simultaneously,
    @cookingbear please help hanyang to support this, Thanks.

CMVN values gets very large, and the loss of MPC is NaN

Hi, after I use our own data(10000h) to calculate the cmvn, I find the var is quite large. and the loss of MPC stage is NaN at the very beginning. Do anyone have any idea?

speaker mean var
global [41.53281, 50.375763, 53.979485, 55.0042, 55.01829, 55.294025, 55.537567, 55.66456, 55.297874, 54.5779, 54.301113, 54.04989, 53.69498, 53.33427, 53.045273, 52.622414, 52.82048, 53.421753, 53.865902, 53.84243, 53.45809, 52.944252, 53.01819, 53.3373, 53.863102, 54.33686, 54.909252, 55.299534, 55.040947, 54.846294, 54.568024, 53.713165, 52.363804, 49.36468, 45.567574, 45.262226, 44.527786, 45.601566, 45.84235, 45.581955] [-1176.7283, -1758.992, -2028.2766, -2111.5051, -2113.496, -2133.8308, -2152.609, -2164.494, -2136.68, -2079.5215, -2057.8743, -2038.9774, -2011.5557, -1983.8772, -1962.4875, -1930.8735, -1947.2842, -1993.6705, -2028.1543, -2025.6558, -1996.1725, -1956.0061, -1962.5137, -1986.1327, -2026.9241, -2065.188, -2109.8545, -2142.8325, -2122.1377, -2107.376, -2086.5679, -2019.9905, -1916.543, -1694.3958, -1440.5925, -1426.0725, -1364.4426, -1428.7397, -1445.1039, -1433.9104]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.