athena-team / athena Goto Github PK
View Code? Open in Web Editor NEWan open-source implementation of sequence-to-sequence based speech processing engine
Home Page: https://athena-team.readthedocs.io
License: Apache License 2.0
an open-source implementation of sequence-to-sequence based speech processing engine
Home Page: https://athena-team.readthedocs.io
License: Apache License 2.0
Hi, thanks for your previous suggestion. And i have delete the "horovod" from the requirements.txt file
But i also find another problem. The detail error:
ERROR: Failed building wheel for kenlm
Running setup.py clean for kenlm
Failed to build kenlm
Installing collected packages: kenlm, jieba, pytz, python-dateutil, pandas
Running setup.py install for kenlm ... error
ERROR: Command errored out with exit status 1:
command: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm
cwd: /tmp/pip-install-zq8e3scd/kenlm/
Complete output (12 lines):
running install
running build
running build_ext
building 'kenlm' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/util
creating build/temp.linux-x86_64-3.6/lm
creating build/temp.linux-x86_64-3.6/util/double-conversion
creating build/temp.linux-x86_64-3.6/python
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I. -I/home/luxy/demo/venv_athena/include -I/usr/include/python3.6m -c util/exception.cc -o build/temp.linux-x86_64-3.6/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -std=c++11
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm Check the logs for full command output.
How to solove it? Thank you very much.
lm_path in decode_config is incorrect in examples/asr/aishell/configs/mtl_transformer_sp.json.
examples/asr/aishell/rnnlm.json => examples/asr/aishell/configs/rnnlm.json
Hi,
When I was running the installation step:
pip3.7 install -r requirements.txt
I had some errors shown as below (it would be too long to paste all of them, so I pasted some screenshots here). I would be very grateful if you could have a look...Is it because I didn't configure the horovod environment right or I didn't successfully install mpi?
Requirement already satisfied: pyasn1>=0.1.3 in /home/pc21/venv_athena/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Building wheels for collected packages: horovod, librosa, kenlm, jieba, psutil, pyyaml, audioread, resampy
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/pc21/venv_athena/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4e373ylc
cwd: /tmp/pip-install-xp_is8ug/horovod/
After many lines:
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fdebug-prefix-map=/build/python3.7-1t2gIN/python3.7-3.7.0~b3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-xp_is8ug/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
Then:
File "/usr/lib/python3.7/subprocess.py", line 453, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.7/subprocess.py", line 756, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1499, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'
INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.
ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for librosa (setup.py) ... done
Created wheel for librosa: filename=librosa-0.7.2-py3-none-any.whl size=1612883 sha256=862bd06e9c89bd1f80e6e702b637d5ddf048093d4fd402073d13aebbacdc0799
Stored in directory: /tmp/pip-ephem-wheel-cache-x3z9ygim/wheels/18/9e/42/3224f85730f92fa2925f0b4fb6ef7f9c5431a64dfc77b95b39
Building wheel for kenlm (setup.py) ... error
ERROR: Command errored out with exit status 1:
export AUTOGRAPH_VERBOSITY=10
) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7fda720b4b70>>: AssertionError: Bad argument number for Name: 3, expecting 4export AUTOGRAPH_VERBOSITY=10
) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4export AUTOGRAPH_VERBOSITY=10
) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4installation is complete, and well done.
tensorflow 2.0.0b0
CUDA 10.0.0
Beam search with CTC joint decoding is really slow, we need to accelerate it. Two solutions on top of my head:
1 split test set into smaller pieces then divide-and-conquer with horovod
2 use tf-function compatible code to rewrite CTC joint decoding part
@cookingbear please look it this
Hi,
my Install Environment:
VMware15.0, Ubuntu 18.04, python3.6.9
Error:*
Running setup.py install for horovod ... error
ERROR: Command errored out with exit status 1:
command: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod
cwd: /tmp/pip-install-s_f3vxnr/horovod/
Complete output (190 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.6/horovod
creating build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/keras
creating build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
creating build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/run
creating build/lib.linux-x86_64-3.6/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.6/horovod/spark
creating build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
creating build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
creating build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.6/horovod/_keras
creating build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.6/horovod/run/task
creating build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/run/http
creating build/lib.linux-x86_64-3.6/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common
creating build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/run/util
creating build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
creating build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
creating build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
creating build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
creating build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
creating build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
creating build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
creating build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
running build_ext
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.so
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 622, in get_common_options
mpi_flags = get_mpi_flags()
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 354, in get_mpi_flags
'%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.
Traceback (most recent call last):
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'
INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1566, in <module>
scripts=['bin/horovodrun'])
File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
self.run_command('build')
File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 87, in run
_build_ext.run(self)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1457, in build_extensions
options = get_common_options(self)
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 635, in get_common_options
raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod Check the logs for full command output.
How to solve it?
Hi, I wonder how to use a pretrained model for a new dataset? The new data is a total different one.
Am I supposed to use the MPC model? If so, do I need to recalculate cmvn for the new dataset?
Or can I skip the MPC stage,and finetune the latter model with new data? i.e, SpeechTransformer, and RNNLM ?
Please give me some guidance. Thanks a lot.
Checkpoint saves files from index 1 by default, whereas the restoring code read ckpt file from index 0.In the athena/decode_main.py, the 55th line code could be modified with 'ckpt_path = p.ckpt + 'ckpt-' + str(idx + 1)'
Hi. We want to use it in mix-precisioned mode, as our GPU don't have much memory, and we want to speed up the training.
I change the code to use mix-precisioned training feature in TF2. It works for MPC (stage 1).
But for the fine-tuning stage, the loss becomes nan at the very beginning.
I try to debug it, and find out the PositionalEncoding in speech_transformer.py is always returning NaN.
input_labels = layers.Input(shape=data_descriptions.sample_shape["output"], dtype=tf.int32)
inner = layers.Embedding(self.num_class, d_model)(input_labels)
inner = PositionalEncoding(d_model, scale=True)(inner) #it returns NaN
inner = layers.Dropout(self.hparams.rate)(inner)
self.y_net = tf.keras.Model(inputs=input_labels, outputs=inner, name="y_net")
could anyone help? Thanks a lot
Pretraining
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr 8 12:05:13 2020[0]<stderr>:================================= Parse Parameter ==============================
Wed Apr 8 12:05:13 2020[0]<stderr>:
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr 8 12:05:13 2020[0]<stderr>:================================= HorovodSolver Init ==============================
Wed Apr 8 12:05:13 2020[0]<stderr>:
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr 8 12:05:13 2020[0]<stderr>:================================= Start Train ==============================
Wed Apr 8 12:05:13 2020[0]<stderr>:
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr 8 12:05:13 2020[0]<stderr>:======================= build model from json file ======================
Wed Apr 8 12:05:13 2020[0]<stderr>:
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr 8 12:05:13 2020[0]<stdout>:Fbank params: [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr 8 12:05:13 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr 8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.813653: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Wed Apr 8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.828699: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399915000 Hz
Wed Apr 8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833653: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aa3912c090 executing computations on platform Host. Devices:
Wed Apr 8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833693: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
Wed Apr 8 12:05:15 2020[0]<stdout>:Model: "x_net"
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:Layer (type) Output Shape Param #
Wed Apr 8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr 8 12:05:15 2020[0]<stdout>:input_1 (InputLayer) [(None, None, 40, 1)] 0
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:conv2d (Conv2D) (None, None, 20, 512) 4608
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:batch_normalization (BatchNo (None, None, 20, 512) 2048
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6 (TensorFlo [(None, None, 20, 512)] 0
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:conv2d_1 (Conv2D) (None, None, 10, 512) 2359296
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:batch_normalization_1 (Batch (None, None, 10, 512) 2048
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6_1 (TensorF [(None, None, 10, 512)] 0
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:reshape (Reshape) (None, None, 5120) 0
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:dense (Dense) (None, None, 512) 2621952
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:positional_encoding (Positio (None, None, 512) 0
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:dropout (Dropout) (None, None, 512) 0
Wed Apr 8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr 8 12:05:15 2020[0]<stdout>:Total params: 4,989,952
Wed Apr 8 12:05:15 2020[0]<stdout>:Trainable params: 4,987,904
Wed Apr 8 12:05:15 2020[0]<stdout>:Non-trainable params: 2,048
Wed Apr 8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr 8 12:05:15 2020[0]<stdout>:None
Wed Apr 8 12:05:16 2020[0]<stderr>:INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mpc
Wed Apr 8 12:05:20 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr 8 12:05:20 2020[0]<stdout>:Fbank params: [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr 8 12:05:21 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr 8 12:05:21 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr 8 12:05:22 2020[0]<stderr>:INFO:absl:Creates the sub-dataset which is the 0 part of 1
Wed Apr 8 12:05:22 2020[0]<stderr>:INFO:absl:
Wed Apr 8 12:05:22 2020[0]<stderr>:>>>>> start training in epoch 0===============================
Wed Apr 8 12:05:22 2020[0]<stderr>:
Wed Apr 8 12:05:22 2020[0]<stderr>:INFO:absl:please be patient, enable tf.function, it takes time ...
Process 0 exit with status code 249.
Traceback (most recent call last):
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/bin/horovodrun", line 21, in <module>
run_commandline()
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
_run(args)
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
_launch_job(args, remote_host_names, settings, common_intfs, command)
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
_launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
.format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 249
@leixiaoning using the new approach to implement the SpeechTransformer2 call function, in which we forward twise: the first time forward is used to generate the predicted target, the second forward is used to intergate the ground truch, so that we can enable the schedule sampling
Hi all,
I used this commit b7b2d91, and trained the aishell example. The occupied memory of GPU went larger and larger with global_steps, finally it ran out of memory.
os: ubuntu 16.04
tensorflow: 2.0.1
Thank you in advance.
In the file examples/asr/aishell/local/prepare_data.py,the 47th line code:if not gfile.Exists(os.path.join(dataset_dir, subset)),I think the variable dataset_dir should be replaced to audio_dir.
when I run thchs30 (like example aishell), I got an error during decoding , but procedure Fine-turning and Training language model are successful.
error message:
ERROR:tensorflow:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f77941ae910>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "athena/decode_main.py", line 89, in <module>
decode(jsonfile, n=5, log_file='nohup_thchs30.out')
File "athena/decode_main.py", line 73, in decode
solver.decode(dataset_builder.as_dataset(batch_size=1))
File "/raid/BH/mitom/athena/athena/solver.py", line 149, in decode
predictions = self.model.decode(samples, self.hparams, lm_model=self.lm_model)
File "/raid/BH/mitom/athena/athena/models/mtl_seq2seq.py", line 109, in decode
history_predictions.write(0, last_predictions)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/tf_should_use.py", line 237, in wrapped
error_in_function=error_in_function)
Why is tensorflow, instead of PyTorch, selected for this project ?
I don't understand why to tar -zxvf
the directory of aishell_enroll
, aishell_major
, aishell_test
And the function convert_audio_and_split_transcript
does not use the parameter output_dir
, So whether there is a bug in this script
Finish the results in HKUST, AISHELL, LIBRISPEECH, SWITHBOARD
We have to do couple of things regarding decoding output:
Add configuration to write file to disk;
Compare WER/CER with standard tools.
Assigning tasks to myself for now.
When I run prepare_data.py for TIMIT, function: get_wave_file_length(wav_file), I got a error below:
Traceback (most recent call last):
File " - examples/asr/timit/local/prepare_data.py", line 116, in
processor(DATASET_DIR, SUBSET, True, OUTPUT_DIR)
File " - examples/asr/timit/local/prepare_data.py", line 100, in processor
convert_audio_and_split_transcript(dataset_dir, subset, subset_csv)
File "- /examples/asr/timit/local/prepare_data.py", line 59, in convert_audio_and_split_transcript
files_size_dict[wav_file] = get_wave_file_length(wav_file)
File "- /athena/utils/misc.py", line 103, in get_wave_file_length
with wave.open(wave_file) as wav_file:
File "/usr/lib/python3.5/wave.py", line 499, in open
return Wave_read(f)
File "/usr/lib/python3.5/wave.py", line 163, in init
self.initfp(f)
File "/usr/lib/python3.5/wave.py", line 130, in initfp
raise Error('file does not start with RIFF id')
wave.Error: file does not start with RIFF id
don't surport RIFF?
Hi, I use mtl_transformer_sp.json for Fine-tuning stage. It finished the epoch 0, but error occurs at some point of epoch 1.
BTW, when I run it without "speed_permutation": [0.9, 1.0, 1.1]
, it works.
Here's the command:
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/seewo/configs/mtl_transformer_sp.json
Here's the config:
`
{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/",
"summary_dir":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/event",
"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},
"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/mine/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},
"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_weight":0.7,
"lm_type": "rnn",
"lm_path":"examples/asr/mine/configs/rnnlm.json"
},
"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},
"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv": "examples/asr/mine/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"speed_permutation": [0.9, 1.0, 1.1],
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"}
}
}
`
And here's the log:
[1,0]:INFO:absl:global_steps: 15314 learning_rate: 1.7122e-04 loss: 3.6860 Accuracy: 0.8822 CTCAccuracy: 0.7879 sec/iter: 0.7007
[1,0]:INFO:absl:global_steps: 15324 learning_rate: 1.7133e-04 loss: 9.4535 Accuracy: 0.8708 CTCAccuracy: 0.7742 sec/iter: 0.6727
[1,0]:INFO:absl:global_steps: 15334 learning_rate: 1.7144e-04 loss: 5.3108 Accuracy: 0.8724 CTCAccuracy: 0.7839 sec/iter: 0.6807
[1,0]:INFO:absl:global_steps: 15344 learning_rate: 1.7155e-04 loss: 21.0516 Accuracy: 0.8447 CTCAccuracy: 0.7498 sec/iter: 0.6419
[1,0]:INFO:absl:global_steps: 15354 learning_rate: 1.7166e-04 loss: 11.5386 Accuracy: 0.8252 CTCAccuracy: 0.7421 sec/iter: 0.6913
[1,0]:INFO:absl:global_steps: 15364 learning_rate: 1.7177e-04 loss: 9.2761 Accuracy: 0.8529 CTCAccuracy: 0.7660 sec/iter: 0.7714
[1,0]:INFO:absl:global_steps: 15374 learning_rate: 1.7189e-04 loss: 6.6673 Accuracy: 0.8661 CTCAccuracy: 0.7800 sec/iter: 0.8484
[1,0]:INFO:absl:global_steps: 15384 learning_rate: 1.7200e-04 loss: 8.5238 Accuracy: 0.8668 CTCAccuracy: 0.7932 sec/iter: 0.8024
[1,0]:INFO:absl:global_steps: 15394 learning_rate: 1.7211e-04 loss: 10.3194 Accuracy: 0.8655 CTCAccuracy: 0.7854 sec/iter: 0.6500
[1,0]:INFO:absl:global_steps: 15404 learning_rate: 1.7222e-04 loss: 7.6797 Accuracy: 0.8471 CTCAccuracy: 0.7623 sec/iter: 0.6240
[1,1]:2020-03-28 02:04:52.312873: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]:2020-03-28 02:04:52.312922: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[1,1]:[ef53b0d505e4:40836] *** Process received signal ***
[1,1]:[ef53b0d505e4:40836] Signal: Aborted (6)
[1,1]:[ef53b0d505e4:40836] Signal code: (-6)
[1,1]:[ef53b0d505e4:40836] [ 0] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f223a69ef20]
[1,1]:[ef53b0d505e4:40836] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f223a69ee97]
[1,1]:[ef53b0d505e4:40836] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f223a6a0801]
[1,1]:[ef53b0d505e4:40836] [ 3] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x88d59b4)[0x7f2187b219b4]
[1,1]:[ef53b0d505e4:40836] [ 4] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f2187a8d357]
[1,1]:[ef53b0d505e4:40836] [ 5] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f2187a8dbef]
[1,1]:[ef53b0d505e4:40836] [ 6] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f217e5718b1]
[1,1]:[ef53b0d505e4:40836] [ 7] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f217e56efa8]
[1,1]:[ef53b0d505e4:40836] [ 8] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0x167b7cf)[0x7f217ebc87cf]
[1,1]:[ef53b0d505e4:40836] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f223a4486db]
[1,1]:[ef53b0d505e4:40836] [10] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f223a78188f]
[1,1]:[ef53b0d505e4:40836] *** End of error message ***Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 1 with PID 0 on node ef53b0d505e4 exited on signal 6 (Aborted).
Using json to configure a sequence model. Please implement this in https://github.com/athena-team/athena/blob/master/athena/models/customized.py
@barius
For example:
python athena/cmvn_main.py \
examples/asr/aishell/configs/mpc.json examples/asr/aishell/data/all.csv || exit 1
Shouldn't it read/write stuff from aishell_sub?
Traceback (most recent call last):
File "athena/main.py", line 171, in
BaseSolver.initialize_devices(p.solver_gpu)
File "/media/runyu/D/works/algorithm/ASR/athena/athena/solver.py", line 54, in initialize_devices
assert len(gpus) > len(visible_gpu_idx)
AssertionError
Error occurred on following codes:
@staticmethod
def initialize_devices(visible_gpu_idx=None):
""" initialize hvd devices, should be called firstly """
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus is not None:
assert len(gpus) > len(visible_gpu_idx)
for idx in visible_gpu_idx:
tf.config.experimental.set_visible_devices(gpus[idx], "GPU")
the reason is the value of gpus is "[]" not "None" when running on pure CPU machine.
so "assert len(gpus) > len(visible_gpu_idx)" called but visible_gpu_idx is "None" and has no len() at this time.
environment: GCC=8.3.0, python=3.7.4
Error:
(venv_athena) (base) root@12e6d012d4d5:~/luxy/athena# pip install -r requirements.txt
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: tensorflow-gpu==2.0.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: sox in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (1.3.7)
Requirement already satisfied: absl-py in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.9.0)
Requirement already satisfied: yapf in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.29.0)
Requirement already satisfied: pylint in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (2.4.4)
Requirement already satisfied: flake8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (3.7.9)
Collecting horovod
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c0/31/dae1f224a284ccaf0fd700565a53658bfba9c3d5964719305953e72a11e0/horovod-0.19.1.tar.gz (2.9 MB)
Collecting tqdm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/1c/6359be64e8301b84160f6f6f7936bbfaaa5e9a4eab6cbc681db07600b949/tqdm-4.45.0-py2.py3-none-any.whl (60 kB)
Collecting sentencepiece
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/11/e0/1264990c559fb945cfb6664742001608e1ed8359eeec6722830ae085062b/sentencepiece-0.1.85-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
Processing /root/.cache/pip/wheels/6e/d3/47/7582e7e63ee9127f4773adeb8dcd8490771c063e2607354ba0/librosa-0.7.2-py3-none-any.whl
Collecting kenlm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/57/54/0cc492b8d7aceb17a9164c6e6b9c9afc2c73706bb39324e8f6fa02f7134a/kenlm-0.tar.gz (1.4 MB)
Processing /root/.cache/pip/wheels/95/1a/6d/75355e7a5c76ed48e2d6cde3b95c4828e83274b93f5392ac96/jieba-0.42.1-py3-none-any.whl
Collecting pandas
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
Requirement already satisfied: keras-applications>=1.0.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: protobuf>=3.6.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.11.3)
Requirement already satisfied: tensorflow-estimator<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: termcolor>=1.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: tensorboard<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: google-pasta>=0.1.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.0)
Requirement already satisfied: wrapt>=1.11.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.12.1)
Requirement already satisfied: grpcio>=1.8.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.28.1)
Requirement already satisfied: gast==0.2.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.2)
Requirement already satisfied: numpy<2.0,>=1.16.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.18.2)
Requirement already satisfied: six>=1.10.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.14.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.0)
Requirement already satisfied: astor>=0.6.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.8.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.34.2)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: mccabe<0.7,>=0.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (0.6.1)
Requirement already satisfied: isort<5,>=4.2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (4.3.21)
Requirement already satisfied: astroid<2.4,>=2.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (2.3.3)
Requirement already satisfied: pycodestyle<2.6.0,>=2.5.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.5.0)
Requirement already satisfied: pyflakes<2.2.0,>=2.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.1.1)
Requirement already satisfied: entrypoints<0.4.0,>=0.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (0.3)
Requirement already satisfied: cloudpickle in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: psutil in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.7.0)
Requirement already satisfied: pyyaml in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.3.1)
Requirement already satisfied: cffi>=1.4.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.14.0)
Processing /root/.cache/pip/wheels/ad/c3/72/f5733d5e4abc9a637c9f6834a1a29429b4cd57b30a4585f91a/resampy-0.2.2-py3-none-any.whl
Processing /root/.cache/pip/wheels/0a/af/f6/aa7eefaad4a35a4f78adbfa0c2a99c53fda489e48132b037e4/audioread-2.1.8-py3-none-any.whl
Collecting decorator>=3.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/ed/1b/72a1821152d07cf1d8b6fce298aeb06a7eb90f4d6d41acec9861e7cc6df0/decorator-4.4.2-py2.py3-none-any.whl (9.2 kB)
Collecting numba>=0.43.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a6/91/3af4fcbe6f9c05f5d04d08b955f635fc9e3388b751a7f0af18e71809e10a/numba-0.48.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB)
Collecting soundfile>=0.9.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting scikit-learn!=0.19.0,>=0.14.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/41/b6/126263db075fbcc79107749f906ec1c7639f69d2d017807c6574792e517e/scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
Collecting scipy>=1.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Collecting joblib>=0.12
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting pytz>=2017.2
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509 kB)
Collecting python-dateutil>=2.6.1
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Requirement already satisfied: h5py in /root/luxy/venv_athena/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.10.0)
Requirement already satisfied: setuptools in /root/luxy/venv_athena/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (40.8.0)
Requirement already satisfied: requests<3,>=2.21.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.23.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.13.1)
Requirement already satisfied: markdown>=2.6.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.1)
Requirement already satisfied: werkzeug>=0.11.15 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.1)
Requirement already satisfied: lazy-object-proxy==1.4.* in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.3)
Requirement already satisfied: typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.1)
Requirement already satisfied: pycparser in /root/luxy/venv_athena/lib/python3.7/site-packages (from cffi>=1.4.0->horovod->-r requirements.txt (line 7)) (2.20)
Collecting llvmlite<0.32.0,>=0.31.0dev0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a0/10/d02c0ac683fc47ecda3426249509cf771d748b6a2c0e9d5ebbee76a7b80a/llvmlite-0.31.0-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.25.8)
Requirement already satisfied: idna<3,>=2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.8)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.1.0)
Building wheels for collected packages: horovod, kenlm
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /root/luxy/venv_athena/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-f6z4f15k
cwd: /tmp/pip-install-j394y9lx/horovod/
Complete output (190 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.7/horovod
creating build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
creating build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
creating build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.7/horovod/common
creating build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.7/horovod/run
creating build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.7/horovod/spark
creating build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.7/horovod/_keras
creating build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/keras
creating build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/torch
creating build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.7/horovod/run/task
creating build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/util
creating build/lib.linux-x86_64-3.7/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common
creating build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.7/horovod/run/http
creating build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
creating build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
creating build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
creating build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
creating build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
creating build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
creating build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
creating build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
running build_ext
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.so
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 622, in get_common_options
mpi_flags = get_mpi_flags()
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 354, in get_mpi_flags
'%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.
Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'
ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for kenlm (setup.py) ... \
How to solove it?
The current decoding is much too slow. It almost takes 5 seconds for a single utterance decoding. I found there was a pull request ongoing for the decoding optimization. When will it be merged into master branch?
We need scripts to validate things such as data directory structure. Otherwise we won't know if certain step fails. I'll assign it to myself for now but may take some time to get back to this.
I had a discussion with @tjadamlee , we think our default examples' GPU memory usage is way too high, which is blocking a lot of users. We should tune the parameters (e.g., batch size) as well as the model structures, to keep the memory usage under 8G for default example setups.
@Some-random could you please take the lead and cut all examples' memory usage under 8G, while keeping the performance as much as possible?
@Some-random what is your number on the updated mtl_transformer_sp.json? Seems like it was couple (2%?) percents worse than the original configuration before #19?
When i tried to run the aishell run.sh
in example directory based on 1 gpu, it shown up the below error.
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 1
Exit code: 1
Hello, genius developer
I am a shenlan student. I installed it according to the instructions. No errors were reported in the middle, and I used the following code to verify that the translation model training is correct.
However, when I train the asr model in the examples/asr/aishell_sub/ directory and run to Fine-tuning, an error is reported.
Traceback (most recent call last):
File "athena/main.py", line 172, in
train(json_file, BaseSolver, 1, 0)
File "athena/main.py", line 117, in train
p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
File "athena/main.py", line 105, in build_model_from_jsonfile
solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
File "/home/hanzl/work/learn/athena/athena/solver.py", line 95, in evaluate_step
logits = self.model(samples, training=False)
...
...
File "/home/hanzl/work/learn/athena/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]
It prompts me to go to the error message above, I found the following information:
2020-04-06 14:48:17.829495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:b1:00.0, compute capability: 7.5)
INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mtl_transformer_ctc/
2020-04-06 14:48:21.230124: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-06 14:48:22.942435: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.0 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Can anyone tell me why this is and how should I solve it? I look forward to your reply, thank you very much!
Currently users have to use redirect instructions which are easily to be forgotten to save logs and the log is critical to be used to extract n-best checkpoints during decoding. So the mechanism to automatically save logs may be needed.
Update the aishell examples.
"dataset_builder": "language_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/train-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
},
"devset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/test-clean-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
}
}
the python script examples/asr/aishell/local/prepare_data.py
,
when I run in a chinese code envirionment, cannot run.
So I suggest add the following code to avoid the above problem:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Here's the mtl_transformer.json for finetuning stage. I set the num_data_threads:32, and it hangs up at the end of the first training epoch.
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/hkust/configs/mtl_transformer.json
{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/hkust/ckpts/mtl_transformer_ctc/",
"summary_dir":"examples/asr/hkust/ckpts/mtl_transformer_ctc/event","solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/hkust/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_type":"ngram",
"lm_weight":0.3,
"lm_path":"examples/asr/hkust/data/5gram.arpa"
},"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 12,
"trainset_config":{
"data_csv": "examples/asr/hkust/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"}
}
}
Here's part of the strace log of one of the process:
restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=448417000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=453560000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=458726000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=463892000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=469063000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=474251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=479418000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=484583000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=489748000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=494910000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa745ac, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=530785000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=535930000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=541074000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=546212000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=551297000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=556410000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=561516000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=566631000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=571733000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=576836000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=581946000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=587075000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=592220000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=597350000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=602524000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=607693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=612850000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=617997000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=623167000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=628321000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=633475000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=638630000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=643783000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=648937000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=654094000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=659251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=664403000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=669553000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=674704000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=679851000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=684998000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=690118000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=695268000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=700402000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=705515000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=710628000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=715747000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=720839000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=725960000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=731070000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=736199000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=741361000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=746488000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=751589000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=756712000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=761827000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=766948000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=772084000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
Hi,
When I was running the decoding stage, I got such error message:
<<<
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/array_ops.py", line 1431, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1249, in concat_v2
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in
NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values
{ list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat
When I typed import horovod.tensorflow, it occured an error like that:
Traceback (most recent call last):
File "", line 1, in
File "/adddisk/zhangjin/projects/horovod/horovod/tensorflow/init.py", line 25, in
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', file, 'mpi_lib')
File "/adddisk/zhangjin/projects/horovod/horovod/common/util.py", line 51, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.tensorflow has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.
when I run aishell decoding scripts independently, Error message below:
......
None
best_wer_checkpoint:
[]
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 1517, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1118, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
......
Building some docker environment
@Some-random add other opensource mandarin dataset
$ python examples/asr/aishell/local/prepare_data.py
Traceback (most recent call last):
File "examples/asr/aishell/local/prepare_data.py", line 25, in
from athena import get_wave_file_length
File "/workspace/users/lpp/source/athena/athena/init.py", line 18, in
from .data import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/init.py", line 18, in
from .datasets.speech_recognition import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/datasets/speech_recognition.py", line 22, in
from athena.transform import AudioFeaturizer
File "/workspace/users/lpp/source/athena/athena/transform/init.py", line 16, in
from athena.transform import audio_featurizer
File "/workspace/users/lpp/source/athena/athena/transform/audio_featurizer.py", line 19, in
from athena.transform import feats
File "/workspace/users/lpp/source/athena/athena/transform/feats/init.py", line 16, in
from athena.transform.feats.read_wav import ReadWav
File "/workspace/users/lpp/source/athena/athena/transform/feats/read_wav.py", line 21, in
from athena.transform.feats.ops import py_x_ops
File "/workspace/users/lpp/source/athena/athena/transform/feats/ops/py_x_ops.py", line 28, in
spectrum = gen_x_ops.spectrum
AttributeError: module '5fa89fc3154996733eabb433e18fa62f' has no attribute 'spectrum'
I will appreciate it for any help.
Wed Apr 1 08:54:43 2020[0]:name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.759
Wed Apr 1 08:54:43 2020[0]:pciBusID: 0000:01:00.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.171558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.173728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.178045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.179579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.184595: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.185004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
Wed Apr 1 08:54:43 2020[0]:Traceback (most recent call last):
Wed Apr 1 08:54:43 2020[0]: File "athena/main.py", line 171, in
Wed Apr 1 08:54:43 2020[0]: BaseSolver.initialize_devices(p.solver_gpu)
Wed Apr 1 08:54:43 2020[0]: File "/mnt/3T/mygits/ASR-NLP/FrameWorks/athena/athena/solver.py", line 54, in initialize_devices
Wed Apr 1 08:54:43 2020[0]: assert len(gpus) > len(visible_gpu_idx)
Wed Apr 1 08:54:43 2020[0]:AssertionError
Process 0 exit with status code 1.
Traceback (most recent call last):
File "/home/wcl/anaconda3/bin/horovodrun", line 21, in
run_commandline()
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
_run(args)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
_launch_job(args, remote_host_names, settings, common_intfs, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
_launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
.format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
I'm trying single GPU and close hvd。
There were quite a lot of read errors dumped during the training process. However, it seemed no impact to the training. Not sure if you ever got such annoying errors. Anything wrong with my training set?
....
[1,7]:INFO:absl:perform batch_wise_shuffle with batch_size 16
[1,2]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,0]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:[373f86b536f2:00117] Read -1, expected 5393, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 6345, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 5873, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 5184, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 5184, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 5440, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 4864, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 4928, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 131072, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 20480, errno = 1
[1,0]:INFO:absl:global_steps: 38522 learning_rate: 2.2517e-04 loss: 1.5734 CTCAccuracy: 0.9167 Accuracy: 0.9323
[1,7]:[373f86b536f2:00124] Read -1, expected 5248, errno = 1
...
start a new branch to integrate athena-decoder into athena project. @godjealous
Hi, after I use our own data(10000h) to calculate the cmvn, I find the var is quite large. and the loss of MPC stage is NaN at the very beginning. Do anyone have any idea?
speaker mean var
global [41.53281, 50.375763, 53.979485, 55.0042, 55.01829, 55.294025, 55.537567, 55.66456, 55.297874, 54.5779, 54.301113, 54.04989, 53.69498, 53.33427, 53.045273, 52.622414, 52.82048, 53.421753, 53.865902, 53.84243, 53.45809, 52.944252, 53.01819, 53.3373, 53.863102, 54.33686, 54.909252, 55.299534, 55.040947, 54.846294, 54.568024, 53.713165, 52.363804, 49.36468, 45.567574, 45.262226, 44.527786, 45.601566, 45.84235, 45.581955] [-1176.7283, -1758.992, -2028.2766, -2111.5051, -2113.496, -2133.8308, -2152.609, -2164.494, -2136.68, -2079.5215, -2057.8743, -2038.9774, -2011.5557, -1983.8772, -1962.4875, -1930.8735, -1947.2842, -1993.6705, -2028.1543, -2025.6558, -1996.1725, -1956.0061, -1962.5137, -1986.1327, -2026.9241, -2065.188, -2109.8545, -2142.8325, -2122.1377, -2107.376, -2086.5679, -2019.9905, -1916.543, -1694.3958, -1440.5925, -1426.0725, -1364.4426, -1428.7397, -1445.1039, -1433.9104]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.