GithubHelp home page GithubHelp logo

tensorflow / lingvo Goto Github PK

View Code? Open in Web Editor NEW
2.8K 119.0 434.0 145.43 MB

Lingvo

License: Apache License 2.0

Python 90.98% Dockerfile 0.07% C++ 5.97% Shell 0.48% Jupyter Notebook 0.39% TeX 0.40% Starlark 1.66% C 0.06%
speech-recognition translation speech-to-text machine-translation mnist seq2seq language-model tts asr lm

lingvo's Introduction

Lingvo

PyPI Python

Documentation

License

What is it?

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

A list of publications using Lingvo can be found here.

Table of Contents

Releases

PyPI Version Commit
0.12.4 --
0.11.0 6fae10077756f54beacd5c454959f20b33fd65e2
0.10.0 075fd1d88fa6f92681f58a2383264337d0e737ee
0.9.1 c1124c5aa7af13d2dd2b6d43293c8ca6d022b008
0.9.0 f826e99803d1b51dccbbbed1ef857ba48a2bbefe
Older releases

PyPI Version Commit
0.8.2 93e123c6788e934e6b7b1fd85770371becf1e92e
0.7.2 b05642fe386ee79e0d88aa083565c9a93428519e

Details for older releases are unavailable.

Major breaking changes

NOTE: this is not a comprehensive list. Lingvo releases do not offer any guarantees regarding backwards compatibility.

HEAD

Nothing here.

0.12.0

  • General
    • Tensorflow 2.9 is now required.
    • Python 3.7 support has been removed.
    • Compatible with (up to) Tensorflow 2.10 and Python 3.10

0.11.0

  • General
    • Tensorflow 2.7 is now the required version.
    • Python 3.6 support has been removed.

0.10.0

  • General
    • Tensorflow 2.6 is now the required version.
    • The theta_fn arg to CreateVariable() has been removed.

0.9.1

  • General
    • Python 3.9 is now supported.
    • ops.beam_search_step now takes and returns an additional arg beam_done.
    • The namedtuple beam_search_helper.BeamSearchDecodeOutput now removes the field done_hyps.

0.9.0

  • General
    • Tensorflow 2.5 is now the required version.
    • Python 3.5 support has been removed.
    • py_utils.AddGlobalVN and py_utils.AddPerStepVN have been combined into py_utils.AddVN.
    • BaseSchedule().Value() no longer takes a step arg.
    • Classes deriving from BaseSchedule should implement Value() not FProp().
    • theta.global_step has been removed in favor of py_utils.GetGlobalStep().
    • py_utils.GenerateStepSeedPair() no longer takes a global_step arg.
    • PostTrainingStepUpdate() no longer takes a global_step arg.
    • The fatal_errors argument to custom input ops now takes error message substrings rather than integer error codes.
Older releases

0.8.2

  • General
    • NestedMap Flatten/Pack/Transform/Filter etc now expand descendent dicts as well.
    • Subclasses of BaseLayer extending from abc.ABCMeta should now extend base_layer.ABCLayerMeta instead.
    • Trying to call self.CreateChild outside of __init__ now raises an error.
    • base_layer.initializer has been removed. Subclasses no longer need to decorate their __init__ function.
    • Trying to call self.CreateVariable outside of __init__ or _CreateLayerVariables now raises an error.
    • It is no longer possible to access self.vars or self.theta inside of __init__. Refactor by moving the variable creation and access to _CreateLayerVariables. The variable scope is set automatically according to the layer name in _CreateLayerVariables.

Details for older releases are unavailable.

Quick start

Installation

There are two ways to set up Lingvo: installing a fixed version through pip, or cloning the repository and building it with bazel. Docker configurations are provided for each case.

If you would just like to use the framework as-is, it is easiest to just install it through pip. This makes it possible to develop and train custom models using a frozen version of the Lingvo framework. However, it is difficult to modify the framework code or implement new custom ops.

If you would like to develop the framework further and potentially contribute pull requests, you should avoid using pip and clone the repository instead.

pip:

The Lingvo pip package can be installed with pip3 install lingvo.

See the codelab for how to get started with the pip package.

From sources:

The prerequisites are:

  • a TensorFlow 2.7 installation,
  • a C++ compiler (only g++ 7.3 is officially supported), and
  • the bazel build system.

Refer to docker/dev.Dockerfile for a set of working requirements.

git clone the repository, then use bazel to build and run targets directly. The python -m module commands in the codelab need to be mapped onto bazel run commands.

docker:

Docker configurations are available for both situations. Instructions can be found in the comments on the top of each file.

How to install docker.

Running the MNIST image model

Preparing the input data

pip:

mkdir -p /tmp/mnist
python3 -m lingvo.tools.keras2ckpt --dataset=mnist

bazel:

mkdir -p /tmp/mnist
bazel run -c opt //lingvo/tools:keras2ckpt -- --dataset=mnist

The following files will be created in /tmp/mnist:

  • mnist.data-00000-of-00001: 53MB.
  • mnist.index: 241 bytes.

Running the model

pip:

cd /tmp/mnist
curl -O https://raw.githubusercontent.com/tensorflow/lingvo/master/lingvo/tasks/image/params/mnist.py
python3 -m lingvo.trainer --run_locally=cpu --mode=sync --model=mnist.LeNet5 --logdir=/tmp/mnist/log

bazel:

(cpu) bazel build -c opt //lingvo:trainer
(gpu) bazel build -c opt --config=cuda //lingvo:trainer
bazel-bin/lingvo/trainer --run_locally=cpu --mode=sync --model=image.mnist.LeNet5 --logdir=/tmp/mnist/log --logtostderr

After about 20 seconds, the loss should drop below 0.3 and a checkpoint will be saved, like below. Kill the trainer with Ctrl+C.

trainer.py:518] step:   205, steps/sec: 11.64 ... loss:0.25747201 ...
checkpointer.py:115] Save checkpoint
checkpointer.py:117] Save checkpoint done: /tmp/mnist/log/train/ckpt-00000205

Some artifacts will be produced in /tmp/mnist/log/control:

  • params.txt: hyper-parameters.
  • model_analysis.txt: model sizes for each layer.
  • train.pbtxt: the training tf.GraphDef.
  • events.*: a tensorboard events file.

As well as in /tmp/mnist/log/train:

  • checkpoint: a text file containing information about the checkpoint files.
  • ckpt-*: the checkpoint files.

Now, let's evaluate the model on the "Test" dataset. In the normal training setup the trainer and evaler should be run at the same time as two separate processes.

pip:

python3 -m lingvo.trainer --job=evaler_test --run_locally=cpu --mode=sync --model=mnist.LeNet5 --logdir=/tmp/mnist/log

bazel:

bazel-bin/lingvo/trainer --job=evaler_test --run_locally=cpu --mode=sync --model=image.mnist.LeNet5 --logdir=/tmp/mnist/log --logtostderr

Kill the job with Ctrl+C when it starts waiting for a new checkpoint.

base_runner.py:177] No new check point is found: /tmp/mnist/log/train/ckpt-00000205

The evaluation accuracy can be found slightly earlier in the logs.

base_runner.py:111] eval_test: step:   205, acc5: 0.99775392, accuracy: 0.94150388, ..., loss: 0.20770954, ...

Running the machine translation model

To run a more elaborate model, you'll need a cluster with GPUs. Please refer to third_party/py/lingvo/tasks/mt/README.md for more information.

Running the GShard transformer based giant language model

To train a GShard language model with one trillion parameters on GCP using CloudTPUs v3-512 using 512-way model parallelism, please refer to third_party/py/lingvo/tasks/lm/README.md for more information.

Running the 3d object detection model

To run the StarNet model using CloudTPUs on GCP, please refer to third_party/py/lingvo/tasks/car/README.md.

Models

Automatic Speech Recognition

Car

Image

Language Modelling

Machine Translation

References

Please cite this paper when referencing Lingvo.

@misc{shen2019lingvo,
    title={Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling},
    author={Jonathan Shen and Patrick Nguyen and Yonghui Wu and Zhifeng Chen and others},
    year={2019},
    eprint={1902.08295},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

License

Apache License 2.0

lingvo's People

Contributors

aaroey avatar anmolgulati avatar bcaine avatar bignamehyp avatar chaiko avatar descrip avatar ds-hwang avatar freeblee avatar jngiam avatar jonathanasdf avatar laurentes avatar lingvo-bot avatar lnzsy avatar mrhuke avatar odashi avatar phoenix-meadowlark avatar protoget avatar rohan-anil avatar ronw avatar rprabhavalkar avatar rsuderman avatar tsainath avatar ukoxyz avatar weihan3 avatar will-cromar avatar yqwangustc avatar yzhang87 avatar zffchen78 avatar zh794390558 avatar zhangqiaorjc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lingvo's Issues

local build error

ERROR: error loading package 'lingvo': Encountered error while reading extension file 'subpar.bzl': no such package '@subpar//': Traceback (most recent call last):
        File "/home/luban/.cache/bazel/_bazel_luban/b5ef85f1c360696308ba7ab9000cfd03/external/bazel_tools/tools/build_defs/repo/git.bzl", line 166
                _clone_or_update(ctx)
        File "/home/luban/.cache/bazel/_bazel_luban/b5ef85f1c360696308ba7ab9000cfd03/external/bazel_tools/tools/build_defs/repo/git.bzl", line 72, in _clone_or_update
                fail(("error cloning %s:\n%s" % (ctx....)))
error cloning subpar:

Issue with lm.one_billion_wds.OneBWdsGPipeTransformer

Hi, I am trying to run the above mentioned model in the docker. I was facing the error when I ran the following command,
**command : ** bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
I have a 4 GPU system so I am using split_size=4. When I asked how to try out Gpipe in issue #48 I was given this command and also was asked to modify OneBWdsGPipeTransformer hparams, I haven't done the changes for hparams is the following error because of that? If I need to change something can you help in what hparams I need to change. I am also posting the error logo below:

**Error log : **

err.txt

training punctuator in notebook crashed

bazel run -c opt //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=cpu --enable_asserts=false

From the command line or notebook, have the same problem as following:
...
...
I0316 18:10:57.939176 139757224118016 base_runner.py:115] step: 0
W0316 18:11:00.613759 139757240633088 meta_graph.py:447] Issue encountered when serializing __batch_norm_update_dict.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'dict' object has no attribute 'name'
W0316 18:11:00.614701 139757240633088 meta_graph.py:447] Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
I0316 18:11:01.184139 139757240633088 trainer.py:270] Save checkpoint done: /tmp/punctuator/train/ckpt-00000000
I0316 18:11:01.189898 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:11.201571 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
2019-03-16 18:11:18.382848: I ./lingvo/core/ops/input_common.h:63] Create RecordProcessor
2019-03-16 18:11:18.551229: I lingvo/core/ops/input_common.cc:28] Input source weights are empty, fall back to legacy behavior.
2019-03-16 18:11:18.551383: I lingvo/core/ops/record_yielder.cc:167] 0x7f1bb518f940 Record
yielder start
2019-03-16 18:11:18.551424: I lingvo/core/ops/record_yielder.cc:169] Randomly seed RecordYielder.
2019-03-16 18:11:18.551447: I ./lingvo/core/ops/input_common.h:68] Create batcher
2019-03-16 18:11:18.551517: I lingvo/core/ops/record_yielder.cc:217] Epoch 1 /tmp/punctuator_data/train.txt
I0316 18:11:24.267724 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:31.178570 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:41.254617 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
Killed

how to use bazel run librispeech scripts?

I am not familiar with bazel. So, I don't know how to run "librispeech.03.parameterize_train.sh" "librispeech.04.parameterize_devtest.sh" and asr task. I have tried run "librispeech.03.parameterize_train.sh" with bash.

root@8f7d00c977c0:/tmp/lingvo# ls
CONTRIBUTING.md  README.md  bazel-genfiles  bazel-testlogs  docs            tf_env_collect.sh
LICENSE          WORKSPACE  bazel-lingvo    codelabs        experiments.md
PUBLICATIONS.md  bazel-bin  bazel-out       docker          lingvo
root@8f7d00c977c0:/tmp/lingvo# sh ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh
=== First pass, collecting transcripts: train-clean-100
./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: 31: ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: bazel-bin/lingvo/tools/create_asr_features: not found
root@8f7d00c977c0:/tmp/lingvo# ls
CONTRIBUTING.md  README.md  bazel-genfiles  bazel-testlogs  docs            tf_env_collect.sh
LICENSE          WORKSPACE  bazel-lingvo    codelabs        experiments.md
PUBLICATIONS.md  bazel-bin  bazel-out       docker          lingvo
root@8f7d00c977c0:/tmp/lingvo# exit

It occurs:

=== First pass, collecting transcripts: train-clean-100
./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: 31: ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: bazel-bin/lingvo/tools/create_asr_features: not found

So, I think it must run with bazel. But I don't know how. Could you please share me bash scripts step by step to run "librispeech.03.parameterize_train.sh" "librispeech.04.parameterize_devtest.sh" and asr task?
Thanks!!!

Did i do something wrong?

Hi. I am attempting to reproduce the ASR librispeech task using Lingvo

My hardwares consist of 16GPU( Cluster x 4 GPU-1080Ti), and i share my storage as NFS.

I changed batch size 96,48 -> 32 (because of OOM)

And i tried to train librispecch 960 Grapheme baseline for 5 days...
(And now i turn off varitional noise now..)

I read your report which need about 11 days for training, but it's gonna be not working on my case...

image

image

image

About 5 days it still at under 40k step.... and WER also stay at about 11%

is it normal speed for my cluster or do i have some problem with network or something...

thanks for your insight.

how could I set params like max_steps

When I ran the example of mnist, I found the hyperparams are set to default values like follows:

task.train.max_steps : 4000000

but where can I set these params? Would I have to directly modify lingvo/tasks/image/params/mnist.py or some code else?

Protocol Buffer Error

ERROR: lingvo/lingvo/core/ops/BUILD:288:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
 #error This file was generated by a newer version of protoc which is
  ^~~~~
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
 #error incompatible with your Protocol Buffer headers.  Please update
  ^~~~~
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:14:2: error: #error your headers.
 #error your headers.
  ^~~~~
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 46.913s, Critical Path: 43.27s
INFO: 16 processes: 16 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
~/.cache/bazel/_bazel_fanlu/8a038f2e6f0570d13154f8149f8b0be3/external/protobuf_protoc/bin/protoc --version
libprotoc 3.6.1

Undefined reference to google::protobuf::FileDescriptor::DebugString()

I just installed lingvo and it looks like I face the issue with protobuf linking. The nightly-tf seems to be up-to-date

mironov@70e0b410070b:~/lingvo$ python -c "import tensorflow as tf;print(tf.__version__)"
1.14.1-dev20190305

The exact error from bazel build is

mironov@70e0b410070b:~/lingvo$ bazel build -c opt //lingvo:trainer
INFO: Analysed target //lingvo:trainer (22 packages loaded).
INFO: Found 1 target...
ERROR: /workspace/lingvo/lingvo/tools/BUILD:98:1: Linking of rule '//lingvo/tools:generate_proto_def' failed (Exit 1)
bazel-out/host/bin/lingvo/tools/_objs/generate_proto_def/generate_proto_def.o:generate_proto_def.cc:function (anonymous namespace)::WriteDotProto(google::protobuf::FileDescriptor const*, char const*): error: undefined reference to 'google::protobuf::FileDescriptor::DebugString() const'
collect2: error: ld returned 1 exit status
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.746s, Critical Path: 2.20s
INFO: 3 processes: 3 processwrapper-sandbox.
FAILED: Build did NOT complete successfully

Could you please check?

bazel build error

DEBUG: Rule 'subpar' modified arguments {"commit": "07ff5feb7c7b113eea593eb6ec50b51099cf0261", "shallow_since": "1524766240 -0700"} and dropped ["tag"]
ERROR: /home/sck/gitRepo/lingvo/lingvo/core/ops/BUILD:277:1: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path. and referenced by '//lingvo/core/ops:tokenizer_ops_kernels'
ERROR: Analysis of target '//lingvo:trainer' failed; build aborted: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path.

I was using a conda virtual env and I've install tensorflow for it.. So how could i make it locate tensorflow installation path. Thanks

Is asr task job can run correctly on gpu?

I use tf-nightly-gpu==1.13.0-dev20181116, but when I run asr task, I got the error below.

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/librispeech/log --logtostderr
INFO:tensorflow:Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
	 [[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 401, in Start
    self._RunLoop('trainer', self._Loop)
  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/retry.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/base_runner.py", line 173, in _RunLoop
    loop_func(*args)
Traceback for above exception (most recent call last):
  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/retry.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 416, in _WaitTillInit
    global_step = sess.run(self._model.global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
Waiting for 0.11 seconds before retrying.

Error when building docker

It failed while installing kiwisolver.

  Building wheel for kiwisolver (setup.py): started
  Building wheel for kiwisolver (setup.py): finished with status 'error'
  ERROR: Complete output from command /usr/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-AG6Mos/kiwisolver/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-OM6uPW --python-tag cp27:
  ERROR: running bdist_wheel
  running build
  running build_ext
  building 'kiwisolver' extension
  creating build
  creating build/temp.linux-x86_64-2.7
  creating build/temp.linux-x86_64-2.7/py
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/kiwisolver.cpp -o build/temp.linux-x86_64-2.7/py/kiwisolver.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  In file included from ./kiwi/constraint.h:13:0,
                   from ./kiwi/kiwi.h:9,
                   from py/kiwisolver.cpp:9:
  ./kiwi/strength.h:30:14: warning: 'kiwi::strength::strong' defined but not used [-Wunused-variable]
   const double strong = create( 1.0, 0.0, 0.0 );
                ^
  ./kiwi/strength.h:32:14: warning: 'kiwi::strength::medium' defined but not used [-Wunused-variable]
   const double medium = create( 0.0, 1.0, 0.0 );
                ^
  ./kiwi/strength.h:34:14: warning: 'kiwi::strength::weak' defined but not used [-Wunused-variable]
   const double weak = create( 0.0, 0.0, 1.0 );
                ^
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/constraint.cpp -o build/temp.linux-x86_64-2.7/py/constraint.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/expression.cpp -o build/temp.linux-x86_64-2.7/py/expression.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/solver.cpp -o build/temp.linux-x86_64-2.7/py/solver.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/strength.cpp -o build/temp.linux-x86_64-2.7/py/strength.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
   };
   ^
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/term.cpp -o build/temp.linux-x86_64-2.7/py/term.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/variable.cpp -o build/temp.linux-x86_64-2.7/py/variable.o
  cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
  creating build/lib.linux-x86_64-2.7
  c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/py/kiwisolver.o build/temp.linux-x86_64-2.7/py/constraint.o build/temp.linux-x86_64-2.7/py/expression.o build/temp.linux-x86_64-2.7/py/solver.o build/temp.linux-x86_64-2.7/py/strength.o build/temp.linux-x86_64-2.7/py/term.o build/temp.linux-x86_64-2.7/py/variable.o -o build/lib.linux-x86_64-2.7/kiwisolver.so
  c++: error: unrecognized command line option '-Wdate-time'
  c++: error: unrecognized command line option '-fstack-protector-strong'
  c++: error: unrecognized command line option '-Wdate-time'
  c++: error: unrecognized command line option '-fstack-protector-strong'
  error: command 'c++' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for kiwisolver

GPU utilization down to 0% without any error infos

Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.

The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.

WXWorkCapture_15572357425815(1)

So anyone know what's going on?

In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!

Training problem

hi, @drpngx , the fraction_of_correct_next_step_preds will decrease to zero at step 20k, Is this behavior strange?
image
image
image

Bazel Test issue

root@e3a29cc3bd18:/tmp/lingvo# bazel test -c opt //lingvo:trainer_test //lingvo:models_test
Extracting Bazel installation...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/root/.cache/bazel/_bazel_root/install/792a28b07894763eaa2bd870f8776b23/_embedded_binaries/A-server.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
ERROR: The 'test' command is only supported from within a workspace.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".

trainer_test not pass

test.log
test.log
test.log

== cat /etc/issue ===============================================
Linux ml-gpu-ser341.nmg01 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.5 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
Yes

== compiler =====================================================
c++ (Ubuntu 4.8.5-4ubuntu2) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902

== uname -a =====================================================
Linux ml-gpu-ser341.nmg01 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy 1.16.2
protobuf 3.7.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.14.1-dev20190324
tf.GIT_VERSION = v1.12.0-10956-g044ff96ba3
tf.COMPILER_VERSION = v1.12.0-10956-g044ff96ba3
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/nvidia/lib64/:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
./tf_env_collect.sh: line 109: nvidia-smi: command not found

== cuda libs ===================================================
/usr/local/cuda-10.0/lib64/libcudart_static.a
/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130
/usr/local/cuda-10.0/doc/man/man7/libcudart.7
/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7

ScopedStepContainer failed which tensorflow version or branch lingvo needs?

:int64*, tensorflow::lingvo::TensorVec*)::__lambda8, const char [22])'
"GenericInputProcessor");
^
lingvo/core/ops/generic_input_op_kernels.cc:115:32: note: candidates are:
In file included from external/tensorflow_includes/tensorflow_includes/tensorflow/core/common_runtime/device.h:43:0,
from external/tensorflow_includes/tensorflow_includes/tensorflow/core/common_runtime/function.h:22,
from lingvo/core/ops/generic_input_op_kernels.cc:18:
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:92:3: note: tensorflow::ScopedStepContainer::ScopedStepContainer(tensorflow::int64, std::function<void(const std::basic_string&)>)
ScopedStepContainer(const int64 step_id,
^
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:92:3: note: candidate expects 2 arguments, 3 provided
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:87:7: note: tensorflow::ScopedStepContainer::ScopedStepContainer(const tensorflow::ScopedStepContainer&)
class ScopedStepContainer {
^
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:87:7: note: candidate expects 1 argument, 3 provided

Regrading Gpipe

Hi, I want to test how Gpipe works, when i searched in the web I found about lingvo repository. Can i know how to run it. I mean i didn't find any documentation so I was a little confused.

An issue upon running the "sudo docker run" command during installation

Hello guys:

I am confused that I ran into an issue after I installed newest version Docker CE. I was exactly following the instructions. At the very beginning, the first two lines went well:

LINGVO_DIR="/tmp/lingvo" # (change to the cloned lingvo directory, e.g. "$HOME/lingvo") LINGVO_DEVICE="gpu" # (Leave empty to build and run CPU only docker)

Then, I copied the file dev.dockerfile into the correct location (${LINGVO_DIR}/docker/dev.dockerfile)
and ran the third line:

sudo docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < ${LINGVO_DIR}/docker/dev.dockerfile

However, when I was running the fourth line:

charles@node28:~$ sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash

I got a strange error and I was not able to solve it by Googling anywhere:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=147707 /var/lib/docker/overlay2/85c4fa2cb2bd50d86984c88450aa4a0003c657a8849a15d6b79124b3d62f6650/merged]\\\\nnvidia-container-cli: requirement error: invalid expression\\\\n\\\"\"": unknown.

Can anyone help me out with it or give me some hint?

bazel test with FAILED errors

After running docker, I tried:

bazel test -c opt //lingvo:trainer_test //lingvo:models_test

But some FAIL occurs:

(base) dm@dm-System-Product-Name:/data/xiaoyubei/codes/lingvo$ sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
root@8f7d00c977c0:/tmp/lingvo# 
root@8f7d00c977c0:/tmp/lingvo# bazel test -c opt //lingvo:trainer_test //lingvo:models_test
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "07ff5feb7c7b113eea593eb6ec50b51099cf0261", shallow_since = "1524766240 -0700" and dropping ["tag"]
INFO: Analysed 2 targets (41 packages loaded, 4890 targets configured).
INFO: Found 2 test targets...
FAIL: //lingvo:trainer_test (shard 4 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log)
FAIL: //lingvo:trainer_test (shard 3 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log)
FAIL: //lingvo:trainer_test (shard 5 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log)

FAILED: //lingvo:trainer_test (Summary)
      /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log
      /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log
      /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log
INFO: Elapsed time: 187.356s, Critical Path: 70.16s
INFO: 29 processes: 29 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 45 total actions
//lingvo:models_test                                                     PASSED in 9.8s
//lingvo:trainer_test                                                    FAILED in 3 out of 5 in 60.9s
  Stats over 5 runs: max = 60.9s, min = 3.7s, avg = 17.3s, dev = 22.2s
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log
  /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log

Executed 2 out of 2 tests: 1 test passes and 1 fails locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command linINFO: Build completed, 1 test FAILED, 45 total actions
root@8f7d00c977c0:/tmp/lingvo# 

Training time

i have training ASR tasks by 4 GPU sync mode and async mode, but the training was so slow; that is training log:

INFO:tensorflow:time:6.841992
INFO:tensorflow:2019.03.30-21:26:33 step:    24 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2489777 log_pplx/logits:9.2489777 loss:9.2489777 loss/logits:9.2489777 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.987753
INFO:tensorflow:2019.03.30-21:26:40 step:    25 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2211275 log_pplx/logits:9.2211275 loss:9.2211275 loss/logits:9.2211275 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.675498
INFO:tensorflow:2019.03.30-21:26:47 step:    26 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.1932364 log_pplx/logits:9.1932364 loss:9.1932364 loss/logits:9.1932364 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.539548
INFO:tensorflow:2019.03.30-21:26:54 step:    27 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2337065 log_pplx/logits:9.2337065 loss:9.2337065 loss/logits:9.2337065 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.667554
INFO:tensorflow:2019.03.30-21:27:01 step:    28 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2273502 log_pplx/logits:9.2273502 loss:9.2273502 loss/logits:9.2273502 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.178711
INFO:tensorflow:2019.03.30-21:27:08 step:    29 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2205391 log_pplx/logits:9.2205391 loss:9.2205391 loss/logits:9.2205391 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.959177
INFO:tensorflow:2019.03.30-21:27:15 step:    30 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2193136 log_pplx/logits:9.2193136 loss:9.2193136 loss/logits:9.2193136 num_samples_in_batch:128 lr:0.00025000001

i have seen lingvo/lingvo/tasks/asr/params/librispeech.py, your training one step may be consume 1s, can you give me some advice?

system info:
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 40%   65C    P2    81W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   68C    P2    84W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 45%   72C    P2    95W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:84:00.0 Off |                  N/A |
| 45%   73C    P2    86W / 250W |  11841MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

undeclared inclusion(s) in rule '//lingvo/core/ops:ascii_tokenizer'

When i try to run the command : bazel build -c opt //lingvo:trainer, I am facing the following error. If I am doing anything wrong, please correct me. I am running the above command in the home directory. I am continuing the issue #48. Before running the main command which you specified there, by observing the other readme I ran the above command.

ERROR:
Starting local Bazel server and connecting to it...
INFO: Analysed target //lingvo:trainer (35 packages loaded, 4188 targets configured).
INFO: Found 1 target...
ERROR: /home/guest/lingvo/lingvo/core/ops/BUILD:67:1: undeclared inclusion(s) in rule '//lingvo/core/ops:ascii_tokenizer':
this rule is missing dependency declarations for the following files included by 'lingvo/core/ops/ascii_tokenizer.cc':
'/usr/include/x86_64-linux-gnu/gnu/stubs-64.h'
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.231s, Critical Path: 1.86s
INFO: 0 processes.
FAILED: Build did NOT complete successfully

Issue with training LM WordLevelOneBwdsSimpleSampledSoftmax Model

I am using tf-nightly 1.14.1-dev20190307.
I am trying to run command :
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.WordLevelOneBwdsSimpleSampledSoftmax --logdir=/tmp/lm1b/log --logtostderr
Error:
Waiting for 12.19 seconds before retrying.
I0417 09:53:59.435111 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I am trying this command on sing machine. it exists after waiting some seconds.

Full error-log:

I0417 09:53:42.698106 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 3.47 seconds before retrying.
I0417 09:53:42.699461 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I0417 09:53:46.173445 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 5.24 seconds before retrying.
I0417 09:53:46.174993 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
2019-04-17 09:53:49.916693: W tensorflow/core/framework/op_kernel.cc:1408] OP_REQUIRES failed at constant_op.cc:76 : Invalid argument: Cannot parse tensor from tensor_proto.
2019-04-17 09:53:49.916769: E tensorflow/core/common_runtime/executor.cc:636] Executor failed to create kernel. Invalid argument: Cannot parse tensor from tensor_proto.
[[{{node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const}}]]
2019-04-17 09:53:50.831555: W tensorflow/core/framework/op_kernel.cc:1408] OP_REQUIRES failed at constant_op.cc:76 : Invalid argument: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1

2019-04-17 09:53:50.831626: E tensorflow/core/common_runtime/executor.cc:636] Executor failed to create kernel. Invalid argument: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1

     [[{{node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const}}]]

I0417 09:53:51.035936 139839403964160 base_runner.py:236] controller done (fatal error).
I0417 09:53:51.038496 139839403964160 base_runner.py:115] controller exception: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1

     [[node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/optimizer.py:60) ]]

Original stack trace for u'1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const':
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1554, in
tf.app.run(main)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1550, in main
RunnerManager(FLAGS.model).Start()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1543, in Start
self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1311, in CreateRunners
trial)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1265, in _CreateRunner
return self.Controller(cfg, *common_args)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 196, in init
self._model.ConstructFPropBPropGraph()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1229, in ConstructFPropBPropGraph
self._task.BProp()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 500, in BProp
self._BPropForVariables(vs)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 691, in _BPropForVariables
var_update_op = self.optimizer.Apply(lr, self._var_grads)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 63, in Apply
var_update_op = _Apply()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 60, in _Apply
[(g, v) for (v, g) in var_grad.Flatten()], name='meta_backprop')
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 577, in apply_gradients
self._create_slots(var_list)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py", line 80, in _create_slots
"accumulator", self._name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 1114, in _get_or_make_slot_with_initializer
var, initializer, shape, dtype, op_name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 164, in create_slot_with_initializer
dtype)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 74, in _create_slot_var
validate_shape=validate_shape)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1502, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1243, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 567, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 519, in _true_getter
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 934, in _get_single_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 212, in call
return cls._variable_v1_call(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 175, in _variable_v1_call
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 154, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2519, in default_variable_creator
expected_shape=expected_shape, import_scope=import_scope)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 216, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1443, in init
constraint=constraint)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1551, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 906, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 247, in call
self.value, dtype=dtype, shape=shape, verify_shape=verify_shape)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 179, in constant_v1
allow_broadcast=False)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 289, in _constant_impl
name=name).outputs[0]
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3479, in create_op
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1961, in init
self._traceback = tf_stack.extract_stack()

E0417 09:53:51.039324 139839403964160 base_runner.py:243] Traceback (most recent call last):
E0417 09:53:51.039395 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
E0417 09:53:51.039463 139839403964160 base_runner.py:243] loop_func(*loop_args)
E0417 09:53:51.039511 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 252, in _Loop
E0417 09:53:51.039556 139839403964160 base_runner.py:243] self._RestoreIfNeeded(sess)
E0417 09:53:51.039599 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 314, in _RestoreIfNeeded
E0417 09:53:51.039643 139839403964160 base_runner.py:243] sess.run([self._initialize_all])
E0417 09:53:51.039685 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
E0417 09:53:51.039726 139839403964160 base_runner.py:243] run_metadata_ptr)
E0417 09:53:51.039766 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
E0417 09:53:51.039814 139839403964160 base_runner.py:243] feed_dict_tensor, options, run_metadata)
E0417 09:53:51.039856 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
E0417 09:53:51.039897 139839403964160 base_runner.py:243] run_metadata)
E0417 09:53:51.039937 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
E0417 09:53:51.039978 139839403964160 base_runner.py:243] raise type(e)(node_def, op, message)
E0417 09:53:51.040018 139839403964160 base_runner.py:243] InvalidArgumentError: Cannot parse tensor from proto: dtype: DT_FLOAT
E0417 09:53:51.040071 139839403964160 base_runner.py:243] tensor_shape {
E0417 09:53:51.040110 139839403964160 base_runner.py:243] dim {
E0417 09:53:51.040149 139839403964160 base_runner.py:243] size: 99184
E0417 09:53:51.040189 139839403964160 base_runner.py:243] }
E0417 09:53:51.040227 139839403964160 base_runner.py:243] dim {
E0417 09:53:51.040266 139839403964160 base_runner.py:243] size: 512
E0417 09:53:51.040306 139839403964160 base_runner.py:243] }
E0417 09:53:51.040344 139839403964160 base_runner.py:243] }
E0417 09:53:51.040385 139839403964160 base_runner.py:243] float_val: 1
E0417 09:53:51.040424 139839403964160 base_runner.py:243]
E0417 09:53:51.040462 139839403964160 base_runner.py:243] [[node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py:60) ]]
E0417 09:53:51.040510 139839403964160 base_runner.py:243]
E0417 09:53:51.040550 139839403964160 base_runner.py:243] Original stack trace for u'1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const':
E0417 09:53:51.040591 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1554, in
E0417 09:53:51.040630 139839403964160 base_runner.py:243] tf.app.run(main)
E0417 09:53:51.040668 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
E0417 09:53:51.040708 139839403964160 base_runner.py:243] _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
E0417 09:53:51.040747 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
E0417 09:53:51.040806 139839403964160 base_runner.py:243] _run_main(main, args)
E0417 09:53:51.040848 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
E0417 09:53:51.040889 139839403964160 base_runner.py:243] sys.exit(main(argv))
E0417 09:53:51.040927 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1550, in main
E0417 09:53:51.041194 139839403964160 base_runner.py:243] RunnerManager(FLAGS.model).Start()
E0417 09:53:51.041244 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1543, in Start
E0417 09:53:51.041289 139839403964160 base_runner.py:243] self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
E0417 09:53:51.041330 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1311, in CreateRunners
E0417 09:53:51.041371 139839403964160 base_runner.py:243] trial)
E0417 09:53:51.041426 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1265, in _CreateRunner
E0417 09:53:51.041467 139839403964160 base_runner.py:243] return self.Controller(cfg, *common_args)
E0417 09:53:51.041507 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 196, in init
E0417 09:53:51.041548 139839403964160 base_runner.py:243] self._model.ConstructFPropBPropGraph()
E0417 09:53:51.041589 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1229, in ConstructFPropBPropGraph
E0417 09:53:51.041630 139839403964160 base_runner.py:243] self._task.BProp()
E0417 09:53:51.041670 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 500, in BProp
E0417 09:53:51.041711 139839403964160 base_runner.py:243] self._BPropForVariables(vs)
E0417 09:53:51.041764 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 691, in _BPropForVariables
E0417 09:53:51.041842 139839403964160 base_runner.py:243] var_update_op = self.optimizer.Apply(lr, self._var_grads)
E0417 09:53:51.041887 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 63, in Apply
E0417 09:53:51.041929 139839403964160 base_runner.py:243] var_update_op = _Apply()
E0417 09:53:51.041970 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 60, in _Apply
E0417 09:53:51.042010 139839403964160 base_runner.py:243] [(g, v) for (v, g) in var_grad.Flatten()], name='meta_backprop')
E0417 09:53:51.042052 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 577, in apply_gradients
E0417 09:53:51.042092 139839403964160 base_runner.py:243] self._create_slots(var_list)
E0417 09:53:51.042146 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py", line 80, in _create_slots
E0417 09:53:51.042186 139839403964160 base_runner.py:243] "accumulator", self._name)
E0417 09:53:51.042226 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 1114, in _get_or_make_slot_with_initializer
E0417 09:53:51.042272 139839403964160 base_runner.py:243] var, initializer, shape, dtype, op_name)
E0417 09:53:51.042314 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 164, in create_slot_with_initializer
E0417 09:53:51.042354 139839403964160 base_runner.py:243] dtype)
E0417 09:53:51.042392 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 74, in _create_slot_var
E0417 09:53:51.042433 139839403964160 base_runner.py:243] validate_shape=validate_shape)
E0417 09:53:51.042473 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1502, in get_variable
E0417 09:53:51.042511 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042551 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1243, in get_variable
E0417 09:53:51.042591 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042630 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 567, in get_variable
E0417 09:53:51.042670 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042709 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 519, in _true_getter
E0417 09:53:51.042747 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042823 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 934, in _get_single_variable
E0417 09:53:51.042869 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042910 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 212, in call
E0417 09:53:51.042949 139839403964160 base_runner.py:243] return cls._variable_v1_call(*args, **kwargs)
E0417 09:53:51.042993 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 175, in _variable_v1_call
E0417 09:53:51.043035 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.043088 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 154, in
E0417 09:53:51.043128 139839403964160 base_runner.py:243] previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
E0417 09:53:51.043167 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2519, in default_variable_creator
E0417 09:53:51.043206 139839403964160 base_runner.py:243] expected_shape=expected_shape, import_scope=import_scope)
E0417 09:53:51.043246 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 216, in call
E0417 09:53:51.043284 139839403964160 base_runner.py:243] return super(VariableMetaclass, cls).call(*args, **kwargs)
E0417 09:53:51.043324 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1443, in init
E0417 09:53:51.043364 139839403964160 base_runner.py:243] constraint=constraint)
E0417 09:53:51.043402 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1551, in _init_from_args
E0417 09:53:51.043442 139839403964160 base_runner.py:243] initial_value(), name="initial_value", dtype=dtype)
E0417 09:53:51.043482 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 906, in
E0417 09:53:51.043520 139839403964160 base_runner.py:243] shape.as_list(), dtype=dtype, partition_info=partition_info)
E0417 09:53:51.043560 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 247, in call
E0417 09:53:51.043600 139839403964160 base_runner.py:243] self.value, dtype=dtype, shape=shape, verify_shape=verify_shape)
E0417 09:53:51.043638 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 179, in constant_v1
E0417 09:53:51.043678 139839403964160 base_runner.py:243] allow_broadcast=False)
E0417 09:53:51.043718 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 289, in _constant_impl
E0417 09:53:51.043756 139839403964160 base_runner.py:243] name=name).outputs[0]
E0417 09:53:51.043818 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
E0417 09:53:51.043859 139839403964160 base_runner.py:243] return func(*args, **kwargs)
E0417 09:53:51.043900 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3479, in create_op
E0417 09:53:51.043941 139839403964160 base_runner.py:243] op_def=op_def)
E0417 09:53:51.043981 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1961, in init
E0417 09:53:51.044020 139839403964160 base_runner.py:243] self._traceback = tf_stack.extract_stack()
E0417 09:53:51.044060 139839403964160 base_runner.py:243]
E0417 09:53:51.044100 139839403964160 base_runner.py:243]
I0417 09:53:51.420242 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 7.94 seconds before retrying.
I0417 09:53:51.421612 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I0417 09:53:59.368693 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 12.19 seconds before retrying.
I0417 09:53:59.435111 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I have successfully build trainer:

root@d4bff1951ef0:/tmp/lingvo# bazel build -c opt //lingvo:trainer
Starting local Bazel server and connecting to it...
INFO: Analysed target //lingvo:trainer (37 packages loaded, 4708 targets configured).
INFO: Found 1 target...
Target //lingvo:trainer up-to-date:
  bazel-bin/lingvo/trainer
INFO: Elapsed time: 4.297s, Critical Path: 0.19s
INFO: 1 process: 1 processwrapper-sandbox.
INFO: Build completed successfully, 5 total actions

and then I want to train with one gpu.

My GPU infos:
GeForce RTX 2070 8GB

But I got the following error infos:

2019-04-26 05:34:50.326859: I lingvo/core/ops/record_yielder.cc:341] Epoch 1 /tmp/librispeech/train/train.tfrecords-*
2019-04-26 05:36:05.261884: I lingvo/core/ops/record_batcher.cc:344] 75 total seconds passed. Total records yielded: 1. Total records skipped: 0
2019-04-26 05:36:15.802940: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcudnn.so.7
2019-04-26 05:36:26.895189: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.121682: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.306301: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.355048: W ./tensorflow/stream_executor/stream.h:1988] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-26 05:36:43.990367: I lingvo/core/ops/record_yielder.cc:313] 0x7f486bfdcf60Basic record yielder exit
......
I0426 05:36:57.030641 139959305754368 trainer.py:270] Save checkpoint done: /tmp/librispeech/Wpm/log/train/ckpt-00000000
2019-04-26 05:36:57.065397: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070022: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070450: I tensorflow/stream_executor/stream.cc:4800] [stream=0x6fa0670,impl=0x6fa0710] did not memcpy host-to-device; source: 0x7f4afbba5740
2019-04-26 05:36:57.070564: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
Aborted (core dumped)

I found this issue tensorflow/tensorflow#24496, and set allow_growth to "True" in lingvo/core/py_utils.py line 394.

session_config.gpu_options.allow_growth = True

(I don't know whether I modify the right place because it is hard for me to find where are the corresponding parameters). But it still can't solve my problem.

I think it may be caused by my poor GPU memory. So, I want to reduce the batch size to reduce the use of GPU memory. But I can't find where to modify batch size in codes. Could you please help me? Thanks a lot.

It is the details:

I0426 05:34:40.309401 139959226857216 base_runner.py:115] step:     0
I0426 05:34:40.994631 139959305754368 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0426 05:34:41.019921 139959305754368 trainer.py:268] Save checkpoint
2019-04-26 05:34:42.855256: W tensorflow/core/framework/allocator.cc:122] Allocation of 200638464 exceeds 10% of system memory.
2019-04-26 05:34:44.004477: W tensorflow/core/framework/allocator.cc:122] Allocation of 200638464 exceeds 10% of system memory.
2019-04-26 05:34:49.756821: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcublas.so.10.0
2019-04-26 05:34:50.238197: I ./lingvo/core/ops/input_common.h:68] Create RecordProcessor
2019-04-26 05:34:50.325285: I lingvo/core/ops/input_common.cc:30] Input source weights are empty, fall back to legacy behavior.
2019-04-26 05:34:50.325678: I lingvo/core/ops/record_yielder.cc:288] 0x7f486bfdcf60 Record yielder start
2019-04-26 05:34:50.325694: I lingvo/core/ops/record_yielder.cc:290] Randomly seed RecordYielder.
2019-04-26 05:34:50.326849: I ./lingvo/core/ops/input_common.h:73] Create batcher
2019-04-26 05:34:50.326859: I lingvo/core/ops/record_yielder.cc:341] Epoch 1 /tmp/librispeech/train/train.tfrecords-*
2019-04-26 05:36:05.261884: I lingvo/core/ops/record_batcher.cc:344] 75 total seconds passed. Total records yielded: 1. Total records skipped: 0
2019-04-26 05:36:15.802940: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcudnn.so.7
2019-04-26 05:36:26.895189: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.121682: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.306301: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.355048: W ./tensorflow/stream_executor/stream.h:1988] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-26 05:36:43.990367: I lingvo/core/ops/record_yielder.cc:313] 0x7f486bfdcf60Basic record yielder exit
I0426 05:36:45.892488 139959226857216 base_runner.py:236] trainer done (fatal error).
I0426 05:36:45.903496 139959226857216 base_runner.py:115] trainer exception: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node fprop/librispeech/tower_0_0/enc/conv_L0/convolution (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py:576) ]]
	 [[gradients/fprop/librispeech/tower_0_0/dec/embedding_lookup_grad/GatherV2_3_G308]]

Errors may have originated from an input operation.
Input Source operations connected to node fprop/librispeech/tower_0_0/enc/conv_L0/convolution:
 fprop/librispeech/tower_0_0/enc/mul (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:2481)	
 fprop/librispeech/Identity_23 (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:1313)

Original stack trace for u'fprop/librispeech/tower_0_0/enc/conv_L0/convolution':
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1557, in <module>
    tf.app.run(main)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1553, in main
    RunnerManager(FLAGS.model).Start()
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1546, in Start
    self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1314, in CreateRunners
    trial)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1277, in _CreateRunner
    return self.Trainer(cfg, *common_args)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 386, in __init__
    self._model.ConstructFPropBPropGraph()
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 1235, in ConstructFPropBPropGraph
    self._task.FPropDefaultTheta()
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 477, in FPropDefaultTheta
    return self.FProp(self.theta, input_batch)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 394, in FProp
    metrics, per_example = self._FPropSplitInputBatch(theta, input_batch)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 440, in _FPropSplitInputBatch
    metrics, per_example = self.FPropTower(theta_local, batch)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 363, in FPropTower
    predicted = self.ComputePredictions(theta, input_batch)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 124, in ComputePredictions
    encoder_outputs = self._FrontendAndEncoderFProp(theta, input_batch_src)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 156, in _FrontendAndEncoderFProp
    return self.encoder.FProp(theta.encoder, input_batch_src)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/encoder.py", line 333, in FProp
    out_padding)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 485, in FProp
    out = self._Compute(theta, inputs, paddings, conv_padding)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 515, in _Compute
    out = self._ApplyConv(theta, inputs, bn_padding_expanded)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 416, in _ApplyConv
    out = ComputeRawConvolution(filter_w)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 407, in ComputeRawConvolution
    padding_algorithm=padding_algorithm)
  File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 576, in _EvaluateConvKernel
    padding=padding_algorithm)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 894, in convolution
    name=name)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 971, in convolution_internal
    name=name)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3598, in create_op
    op_def=op_def)
  File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
    self._traceback = tf_stack.extract_stack()


E0426 05:36:45.934931 139959226857216 base_runner.py:243] Traceback (most recent call last):
E0426 05:36:45.935090 139959226857216 base_runner.py:243]   File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/base_runner.py", line 196, in _RunLoop
E0426 05:36:45.935159 139959226857216 base_runner.py:243]     loop_func(*loop_args)
E0426 05:36:45.935215 139959226857216 base_runner.py:243]   File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 508, in _Loop
E0426 05:36:45.935441 139959226857216 base_runner.py:243]     model_task.per_example_tensors,
E0426 05:36:45.935498 139959226857216 base_runner.py:243]   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
E0426 05:36:45.935554 139959226857216 base_runner.py:243]     run_metadata_ptr)
E0426 05:36:45.935607 139959226857216 base_runner.py:243]   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
E0426 05:36:45.935661 139959226857216 base_runner.py:243]     feed_dict_tensor, options, run_metadata)
E0426 05:36:45.935713 139959226857216 base_runner.py:243]   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
E0426 05:36:45.935761 139959226857216 base_runner.py:243]     run_metadata)
E0426 05:36:45.935817 139959226857216 base_runner.py:243]   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
E0426 05:36:45.935863 139959226857216 base_runner.py:243]     raise type(e)(node_def, op, message)
E0426 05:36:45.935920 139959226857216 base_runner.py:243] UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
E0426 05:36:45.935975 139959226857216 base_runner.py:243] 	 [[node fprop/librispeech/tower_0_0/enc/conv_L0/convolution (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py:576) ]]
E0426 05:36:45.936033 139959226857216 base_runner.py:243] 	 [[gradients/fprop/librispeech/tower_0_0/dec/embedding_lookup_grad/GatherV2_3_G308]]
E0426 05:36:45.936084 139959226857216 base_runner.py:243] 
E0426 05:36:45.936136 139959226857216 base_runner.py:243] Errors may have originated from an input operation.
E0426 05:36:45.936187 139959226857216 base_runner.py:243] Input Source operations connected to node fprop/librispeech/tower_0_0/enc/conv_L0/convolution:
E0426 05:36:45.936237 139959226857216 base_runner.py:243]  fprop/librispeech/tower_0_0/enc/mul (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:2481)	
E0426 05:36:45.936291 139959226857216 base_runner.py:243]  fprop/librispeech/Identity_23 (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:1313)
E0426 05:36:45.936342 139959226857216 base_runner.py:243] 
E0426 05:36:45.936393 139959226857216 base_runner.py:243] Original stack trace for u'fprop/librispeech/tower_0_0/enc/conv_L0/convolution':
E0426 05:36:45.936444 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1557, in <module>
E0426 05:36:45.936492 139959226857216 base_runner.py:243]     tf.app.run(main)
E0426 05:36:45.936542 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
E0426 05:36:45.936590 139959226857216 base_runner.py:243]     _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
E0426 05:36:45.936642 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
E0426 05:36:45.936691 139959226857216 base_runner.py:243]     _run_main(main, args)
E0426 05:36:45.936743 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
E0426 05:36:45.936791 139959226857216 base_runner.py:243]     sys.exit(main(argv))
E0426 05:36:45.936841 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1553, in main
E0426 05:36:45.936891 139959226857216 base_runner.py:243]     RunnerManager(FLAGS.model).Start()
E0426 05:36:45.936939 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1546, in Start
E0426 05:36:45.936990 139959226857216 base_runner.py:243]     self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
E0426 05:36:45.937038 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1314, in CreateRunners
E0426 05:36:45.937088 139959226857216 base_runner.py:243]     trial)
E0426 05:36:45.937138 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1277, in _CreateRunner
E0426 05:36:45.937187 139959226857216 base_runner.py:243]     return self.Trainer(cfg, *common_args)
E0426 05:36:45.937237 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 386, in __init__
E0426 05:36:45.937285 139959226857216 base_runner.py:243]     self._model.ConstructFPropBPropGraph()
E0426 05:36:45.937335 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 1235, in ConstructFPropBPropGraph
E0426 05:36:45.937383 139959226857216 base_runner.py:243]     self._task.FPropDefaultTheta()
E0426 05:36:45.937434 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 477, in FPropDefaultTheta
E0426 05:36:45.937490 139959226857216 base_runner.py:243]     return self.FProp(self.theta, input_batch)
E0426 05:36:45.937540 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 394, in FProp
E0426 05:36:45.937588 139959226857216 base_runner.py:243]     metrics, per_example = self._FPropSplitInputBatch(theta, input_batch)
E0426 05:36:45.937638 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 440, in _FPropSplitInputBatch
E0426 05:36:45.937689 139959226857216 base_runner.py:243]     metrics, per_example = self.FPropTower(theta_local, batch)
E0426 05:36:45.937736 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 363, in FPropTower
E0426 05:36:45.937787 139959226857216 base_runner.py:243]     predicted = self.ComputePredictions(theta, input_batch)
E0426 05:36:45.937835 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 124, in ComputePredictions
E0426 05:36:45.937886 139959226857216 base_runner.py:243]     encoder_outputs = self._FrontendAndEncoderFProp(theta, input_batch_src)
E0426 05:36:45.937937 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 156, in _FrontendAndEncoderFProp
E0426 05:36:45.937984 139959226857216 base_runner.py:243]     return self.encoder.FProp(theta.encoder, input_batch_src)
E0426 05:36:45.938035 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/encoder.py", line 333, in FProp
E0426 05:36:45.938085 139959226857216 base_runner.py:243]     out_padding)
E0426 05:36:45.938134 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 485, in FProp
E0426 05:36:45.938184 139959226857216 base_runner.py:243]     out = self._Compute(theta, inputs, paddings, conv_padding)
E0426 05:36:45.938231 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 515, in _Compute
E0426 05:36:45.938282 139959226857216 base_runner.py:243]     out = self._ApplyConv(theta, inputs, bn_padding_expanded)
E0426 05:36:45.938330 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 416, in _ApplyConv
E0426 05:36:45.938374 139959226857216 base_runner.py:243]     out = ComputeRawConvolution(filter_w)
E0426 05:36:45.938416 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 407, in ComputeRawConvolution
E0426 05:36:45.938456 139959226857216 base_runner.py:243]     padding_algorithm=padding_algorithm)
E0426 05:36:45.938496 139959226857216 base_runner.py:243]   File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 576, in _EvaluateConvKernel
E0426 05:36:45.938535 139959226857216 base_runner.py:243]     padding=padding_algorithm)
E0426 05:36:45.938587 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 894, in convolution
E0426 05:36:45.938632 139959226857216 base_runner.py:243]     name=name)
E0426 05:36:45.938674 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 971, in convolution_internal
E0426 05:36:45.938719 139959226857216 base_runner.py:243]     name=name)
E0426 05:36:45.938760 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
E0426 05:36:45.938802 139959226857216 base_runner.py:243]     data_format=data_format, dilations=dilations, name=name)
E0426 05:36:45.938844 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
E0426 05:36:45.938886 139959226857216 base_runner.py:243]     op_def=op_def)
E0426 05:36:45.938930 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
E0426 05:36:45.938975 139959226857216 base_runner.py:243]     return func(*args, **kwargs)
E0426 05:36:45.939023 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3598, in create_op
E0426 05:36:45.939069 139959226857216 base_runner.py:243]     op_def=op_def)
E0426 05:36:45.939116 139959226857216 base_runner.py:243]   File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
E0426 05:36:45.939160 139959226857216 base_runner.py:243]     self._traceback = tf_stack.extract_stack()
E0426 05:36:45.939203 139959226857216 base_runner.py:243] 
E0426 05:36:45.939245 139959226857216 base_runner.py:243] 
W0426 05:36:56.804146 139959305754368 meta_graph.py:447] Issue encountered when serializing __batch_norm_update_dict.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'dict' object has no attribute 'name'
W0426 05:36:56.804761 139959305754368 meta_graph.py:447] Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
I0426 05:36:57.030641 139959305754368 trainer.py:270] Save checkpoint done: /tmp/librispeech/Wpm/log/train/ckpt-00000000
2019-04-26 05:36:57.065397: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070022: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070450: I tensorflow/stream_executor/stream.cc:4800] [stream=0x6fa0670,impl=0x6fa0710] did not memcpy host-to-device; source: 0x7f4afbba5740
2019-04-26 05:36:57.070564: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
Aborted (core dumped)

Error when running docker

I have build docker successfully before.

`sudo docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < ${LINGVO_DIR}/docker/dev.dockerfile

Sending build context to Docker daemon 5.12kB

Step 1/19 : ARG cpu_base_image="ubuntu:16.04"

Step 2/19 : ARG base_image=$cpu_base_image

Step 3/19 : FROM $base_image

10.0-cudnn7-runtime-ubuntu16.04: Pulling from nvidia/cuda

34667c7e4631: Pull complete

d18d76a881a4: Pull complete

119c7358fbfc: Pull complete

2aaf13f3eff0: Pull complete

643564d518c8: Pull complete

1fea03e629a4: Pull complete

45402f4cf61d: Pull complete

86f75b2a221d: Downloading

9e547bd511ba: Download complete

EOF`

But when I run docker:
sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash

I got the following error:
docker: Error response from daemon: Unknown runtime specified nvidia

Have someone meet the same problem with me? Thanks!

What is controller_gpus/worker_gpus/ps_gpus/evaler_gpus/decoder_gpus?

Hi, I use this script to run asr task:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/librispeech/log --logtostderr --enable_asserts=false

And I have 8 GPU to run this experiment. However, it only uses one gpu.

Screenshot from 2019-04-28 11-35-17

So, I think it must be some parameters to change. Then, I found controller_gpus/worker_gpus/ps_gpus/evaler_gpus/decoder_gpus in lingvo/trainer.py.

Could you mind tell me what is the meaning of controller_gpus/worker_gpus/ps_gpus/evaler_gpus/decoder_gpus? Which should I change to run this task on multi GPU?

I tried set worker_gpus to 8. Then the memory of all GPUs is indeed occupied, but the utilization is still not very high, so I am not sure whether I set correctly.

model training time

what is the exact training time of the mode i.e. for how many steps/epochs it will do training ?

Why you should use Lingvo than T2T?

Hi Lingvo Devs,

TBH, i got confused when new library comes to offer similar features, After skimming the papers i found quiet similar with T2T. My only question is why we need Lingvo, instead using existing T2T to generate new idea? Thanks!

Error while trying to run using docker

Hi, when I tried to do it directly without docker, I was facing many problems but didn't find a solution. Now I tried using docker and I am facing the following error when I run the command,
command : bazel build -c opt //lingvo:trainer
Error:
INFO: Analysed target //lingvo:trainer (0 packages loaded).
INFO: Found 1 target...
ERROR: missing input file '@tensorflow_solib//:tensorflow_solib/libtensorflow_framework.so'
ERROR: /tmp/lingvo/lingvo/BUILD:150:1: Creating runfiles tree bazel-out/k8-opt/bin/lingvo/trainer.runfiles failed: Process terminated by signal 15: Process terminated by signal 15
ERROR: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/tensorflow_solib/BUILD:2:1: @tensorflow_solib//:framework_lib: missing input file '@tensorflow_solib//:tensorflow_solib/libtensorflow_framework.so'
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/tensorflow_solib/BUILD:2:1 1 input file(s) do not exist
INFO: Elapsed time: 0.195s, Critical Path: 0.01s
INFO: 0 processes.
FAILED: Build did NOT complete successfully

Could not locate tensorflow installation path.

Hi, as you told in #49 that supported version was ubuntu 16. So I was also trying to run on another system which consists of ubuntu 16. But I was facing Could not locate tensorflow path everytime. But I do have tensorflow installed in my system.

Command : bazel build -c opt //lingvo:trainer
output:
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' modified arguments {"commit": "07ff5feb7c7b113eea593eb6ec50b51099cf0261", "shallow_since": "1524766240 -0700"} and dropped ["tag"]
ERROR: /home/guest/lingvo/lingvo/core/ops/BUILD:24:1: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/guest/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/guest/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path. and referenced by '//lingvo/core/ops:x_ops'
ERROR: Analysis of target '//lingvo:trainer' failed; build aborted: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/guest/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/guest/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path.
INFO: Elapsed time: 1.193s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 24 targets configured)
Fetching @protobuf_protoc; fetching
Fetching @tensorflow_solib; fetching
Fetching @tensorflow_includes; fetching

Getting segmentation fault while trying to run Gpipe example code

Hi, as you mentioned in #48 about changing OneBWdsGPipeTransformer hparams and then try to run on 8 GPU's and gave the command to run. I did not understand what are those parameters, can I get help which parameters fit for my system. I am using machine consisting of 4 GPU. What ever the parameters I change I am facing segmentation fault core dumped. I am also attaching my system info(GPU).

command : bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
segmentation fault.txt

system info:
GPU:
sys_info.txt

Unstable eval_dev outputs?

When running with the following command line
image
the eval_dev plot is extremely bumpy (seems evaluation is run multiple times at each step)
image
Any guess what could be the problem?

What's the difference between 'sync' mode and 'async' mode?

It seems that we can train on multiple GPUs regardless of whether the mode is set to synchronous or asynchronous. So, could you mind tell me what's the difference of 'sync' mode and 'async' mode? What are their respective advantages?

Error while trying to run Gpipe example

Hi, as I was getting error #49 . Now i tried to do it in another environment. And i am facing the below problem when i try to run the command: bazel build -c opt //lingvo:trainer. Before running the main command which was specified to me in #48 i am running the above command before it in the home directory. Please tell me if I am doing any step wrong.

**ERROR : **
Gpipe_error.txt

What should I feed in ASR inference?

I modify the model inference part in "/codelabs/introduction.ipynb" to inference ASR task.

I have tried many variables to feed. However, none of them are correct.

When I feed waveform, it occurs:

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-13-533e43df44a2> in <module>()
     24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
     25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav=wav_file)
     27 print(hyps)
     28 print(src_frames)

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
    206         report_tensor_allocations_upon_oom=False)
    207     return self._RunWithValidSession(
--> 208         tf.Session.run, fetches, feed_dict=feeds, options=run_options)
    209 
    210 

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
     48       for retries in itertools.count(0):
     49         try:
---> 50           return func(*args, **kwargs)
     51         except retry_value as e:
     52           if retries >= max_retries:

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
    158     sess_id = self._cur_sess_id
    159     try:
--> 160       return fn(self._sess, *args, **kwargs)
    161     except py_utils.transient_tf_errors:
    162       # self._sess is invalid, most likely due to the worker being preempted.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    946     try:
    947       result = self._run(None, fetches, feed_dict, options_ptr,
--> 948                          run_metadata_ptr)
    949       if run_metadata:
    950         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1169     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1170       results = self._do_run(handle, final_targets, final_fetches,
-> 1171                              feed_dict_tensor, options, run_metadata)
   1172     else:
   1173       results = []

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1346     if handle is None:
   1347       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1348                            run_metadata)
   1349     else:
   1350       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
   1366           pass
   1367       message = error_interpolation.interpolate(message, self._graph)
-> 1368       raise type(e)(node_def, op, message)
   1369 
   1370   def _extend_graph(self):

InternalError: Unable to get element as bytes.

When I feed tensor of waveform, it occurs:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-1a9314d06d8c> in <module>()
     24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
     25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav=wav_tensor)
     27 print(hyps)
     28 print(src_frames)

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
    206         report_tensor_allocations_upon_oom=False)
    207     return self._RunWithValidSession(
--> 208         tf.Session.run, fetches, feed_dict=feeds, options=run_options)
    209 
    210 

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
     48       for retries in itertools.count(0):
     49         try:
---> 50           return func(*args, **kwargs)
     51         except retry_value as e:
     52           if retries >= max_retries:

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
    158     sess_id = self._cur_sess_id
    159     try:
--> 160       return fn(self._sess, *args, **kwargs)
    161     except py_utils.transient_tf_errors:
    162       # self._sess is invalid, most likely due to the worker being preempted.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    946     try:
    947       result = self._run(None, fetches, feed_dict, options_ptr,
--> 948                          run_metadata_ptr)
    949       if run_metadata:
    950         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1120                             'For reference, the tensor object was ' +
   1121                             str(feed_val) + ' which was passed to the '
-> 1122                             'feed with key ' + str(feed) + '.')
   1123 
   1124           subfeed_dtype = subfeed_t.dtype.as_numpy_dtype

TypeError: The value of a feed cannot be a tf.Tensor object. Acceptable feed values include Python scalars, strings, lists, numpy ndarrays, or TensorHandles. For reference, the tensor object was Tensor("Const_2:0", shape=(60080,), dtype=int16) which was passed to the feed with key inference/default/wav:0.

When I feed the filename of wav, it occurs:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-15-17fbc70adb3e> in <module>()
     24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
     25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav='/tmp/librispeech/arctic_a0002.wav')
     27 print(hyps)
     28 print(src_frames)

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
    206         report_tensor_allocations_upon_oom=False)
    207     return self._RunWithValidSession(
--> 208         tf.Session.run, fetches, feed_dict=feeds, options=run_options)
    209 
    210 

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
     48       for retries in itertools.count(0):
     49         try:
---> 50           return func(*args, **kwargs)
     51         except retry_value as e:
     52           if retries >= max_retries:

/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
    158     sess_id = self._cur_sess_id
    159     try:
--> 160       return fn(self._sess, *args, **kwargs)
    161     except py_utils.transient_tf_errors:
    162       # self._sess is invalid, most likely due to the worker being preempted.

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
    946     try:
    947       result = self._run(None, fetches, feed_dict, options_ptr,
--> 948                          run_metadata_ptr)
    949       if run_metadata:
    950         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1169     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1170       results = self._do_run(handle, final_targets, final_fetches,
-> 1171                              feed_dict_tensor, options, run_metadata)
   1172     else:
   1173       results = []

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1346     if handle is None:
   1347       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1348                            run_metadata)
   1349     else:
   1350       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
   1366           pass
   1367       message = error_interpolation.interpolate(message, self._graph)
-> 1368       raise type(e)(node_def, op, message)
   1369 
   1370   def _extend_graph(self):

InvalidArgumentError: Header mismatch: Expected RIFF but found /tmp
	 [[node inference/default/DecodeWav (defined at lingvo/core/predictor.py:93) ]]

Original stack trace for u'inference/default/DecodeWav':
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 1073, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 456, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 486, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 438, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2714, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2818, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2878, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-17fbc70adb3e>", line 25, in <module>
    pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
  File "lingvo/core/predictor.py", line 93, in __init__
    tf.import_graph_def(inference_graph.graph_def, name="")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3733, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3623, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
    self._traceback = tf_stack.extract_stack()

I also tried a list of them separately. However, none of them are correct. I have seen inference function in /tasks/asr/model.py and DecodeWav function in /tools/audio_lib.py. So I tried the above ways.

Could you please tell me what should I feed? I extremely expect you to provide me with the inference code of ASR task. Thank you so much!!

Lingvo Docker Sharing Now!!!

I have built a lingvo docker on dockerhub . Nvidia-cuda-10 needs docker-v2, which is not supported by my work env. Thus I use the following settting:

tensorflow: gpu, v1.12.0
lingvo: master
cuda: 9.0
docker: nvidia-docker (not "-v2")

image

dlopen(lingvo/core/ops/x_ops.so, 6): image not found

while exec the following code,

# Running this cell is equivalent to running the following command:
# (cpu) bazel run -c opt //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=cpu
# (gpu) bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=gpu

# Reset the kernel to make sure changes to the model params are re-registered.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=True)

# Start tensorboard (access at http://localhost:6006)
import os
os.system('lsof -t -i:6006 || tensorboard --logdir=/tmp/nqg &')

# Start the trainer
import tensorflow as tf
from lingvo import trainer
argv = [
  "trainer.py",
  "--model=nqg.train.RNMTModel",
  "--mode=sync",
  "--logdir=/tmp/nqg",
  "--saver_max_to_keep=2",
  "--run_locally=gpu",  # or cpu.
]
tf.app.run(trainer.main, argv=argv)

the error is

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-1-4b9a3f22cd93> in <module>
      9 # Start the trainer
     10 import tensorflow as tf
---> 11 from lingvo import trainer
     12 argv = [
     13   "trainer.py",

5 frames
/code/lingvo/lingvo/core/ops/py_x_ops.py in <module>
     24 
     25 gen_x_ops = tf.load_op_library(
---> 26     tf.resource_loader.get_path_to_datafile('x_ops.so'))
     27 
     28 if 'assert_shape_match' not in dir(gen_x_ops):

~/Library/Python/3.6/lib/python/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename)
     54     RuntimeError: when unable to load the library or get the python wrappers.
     55   """
---> 56   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     57 
     58   op_list_str = py_tf.TF_GetOpList(lib_handle)

NotFoundError: dlopen(/code/lingvo/lingvo/core/ops/x_ops.so, 6): no suitable image found.  Did find:
	/code/lingvo/lingvo/core/ops/x_ops.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
	/code/lingvo/lingvo/core/ops/x_ops.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00

any one knows how to fix it ?

Is there training and decoding recipe for the ASR task

my command is:

bazel-bin/lingvo/trainer --enable_asserts=false --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/librispeech/log --logtostderr

the output log is:

I0308 16:13:45.249138 140304451606272 trainer.py:521] step:   905 fraction_of_correct_next_step_preds:0.0082426639 fraction_of_correct_next_step_preds/logits:0.0082426639 log_pplx:4.96068 log_pplx/logits:4.96068 loss:4.96068 loss/logits:4.96068 num_samples_in_batch:48
I0308 16:13:49.943696 140304032200448 trainer.py:371] Steps/second: 0.014225, Examples/second: 0.731300
I0308 16:13:59.952753 140304032200448 trainer.py:371] Steps/second: 0.014223, Examples/second: 0.731181
I0308 16:14:09.964931 140304032200448 trainer.py:371] Steps/second: 0.014221, Examples/second: 0.731061
I0308 16:14:19.976296 140304032200448 trainer.py:371] Steps/second: 0.014218, Examples/second: 0.730942
I0308 16:14:29.987987 140304032200448 trainer.py:371] Steps/second: 0.014216, Examples/second: 0.730823
I0308 16:14:39.996059 140304032200448 trainer.py:371] Steps/second: 0.014214, Examples/second: 0.730704
I0308 16:14:50.002638 140304032200448 trainer.py:371] Steps/second: 0.014211, Examples/second: 0.730585
I0308 16:15:00.011704 140304032200448 trainer.py:371] Steps/second: 0.014209, Examples/second: 0.730466
I0308 16:15:03.377721 140304451606272 trainer.py:521] step:   906 fraction_of_correct_next_step_preds:0.003683995 fraction_of_correct_next_step_preds/logits:0.003683995 log_pplx:4.9685979 log_pplx/logits:4.9685979 loss:4.9685979 loss/logits:4.9685979 num_samples_in_batch:48
I0308 16:15:10.020848 140304032200448 trainer.py:371] Steps/second: 0.014223, Examples/second: 0.731128
I0308 16:15:20.030227 140304032200448 trainer.py:371] Steps/second: 0.014221, Examples/second: 0.731009
I0308 16:15:30.039532 140304032200448 trainer.py:371] Steps/second: 0.014218, Examples/second: 0.730890
I0308 16:15:40.050379 140304032200448 trainer.py:371] Steps/second: 0.014216, Examples/second: 0.730771
I0308 16:15:50.057887 140304032200448 trainer.py:371] Steps/second: 0.014214, Examples/second: 0.730652
I0308 16:16:00.066382 140304032200448 trainer.py:371] Steps/second: 0.014211, Examples/second: 0.730533
I0308 16:16:10.077025 140304032200448 trainer.py:371] Steps/second: 0.014209, Examples/second: 0.730414
I0308 16:16:18.414843 140304451606272 trainer.py:521] step:   907 fraction_of_correct_next_step_preds:0.0043800189 fraction_of_correct_next_step_preds/logits:0.0043800189 log_pplx:4.9653683 log_pplx/logits:4.9653683 loss:4.9653683 loss/logits:4.9653683 num_samples_in_batch:48

1.my task run like this above. But after some steps, the loss seems not well?
2.do you have some recipe about ASR decoding?
Thanks

list index out of range when running model

when I am running tf.app.run(trainer.main,argv=argv) in model training part I am getting list index out of range error.
pls tell me how to get rid out of this...

I am not getting to which element its trying to access

Training ASR models with gpus get exception "Resource exhausted: OOM"

屏幕快照 2019-05-08 上午9 49 30

These are my gpus. I tried to use them to train both asr models with command following:

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --worker_gpus=2 --logdir=/tmp/librispeech/log --model=asr.librispeech.Librispeech960Wpm --logtostderr --enable_asserts=false >& /tmp/librispeech/log/train.log

The trainer.py process got the exception before training process start. But it can run on cpu.

So, how can I train these models with my gpus?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.