tensorflow / lingvo Goto Github PK
View Code? Open in Web Editor NEWLingvo
License: Apache License 2.0
Lingvo
License: Apache License 2.0
when I am running tf.app.run(trainer.main,argv=argv) in model training part I am getting list index out of range error.
pls tell me how to get rid out of this...
I am not getting to which element its trying to access
Hi, I use this script to run asr task:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Wpm --logdir=/tmp/librispeech/log --logtostderr --enable_asserts=false
And I have 8 GPU to run this experiment. However, it only uses one gpu.
So, I think it must be some parameters to change. Then, I found controller_gpus/worker_gpus/ps_gpus/evaler_gpus/decoder_gpus in lingvo/trainer.py.
Could you mind tell me what is the meaning of controller_gpus/worker_gpus/ps_gpus/evaler_gpus/decoder_gpus? Which should I change to run this task on multi GPU?
I tried set worker_gpus to 8. Then the memory of all GPUs is indeed occupied, but the utilization is still not very high, so I am not sure whether I set correctly.
How can I run lingvo\gpipe on TPU instead of a GPU?
ERROR: lingvo/lingvo/core/ops/BUILD:288:1: C++ compilation of rule '//lingvo/core/ops:hyps_proto' failed (Exit 1)
In file included from bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.cc:4:0:
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
#error This file was generated by a newer version of protoc which is
^~~~~
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
#error incompatible with your Protocol Buffer headers. Please update
^~~~~
bazel-out/k8-opt/genfiles/lingvo/core/ops/hyps.pb.h:14:2: error: #error your headers.
#error your headers.
^~~~~
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 46.913s, Critical Path: 43.27s
INFO: 16 processes: 16 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
~/.cache/bazel/_bazel_fanlu/8a038f2e6f0570d13154f8149f8b0be3/external/protobuf_protoc/bin/protoc --version
libprotoc 3.6.1
Hi, as you told in #49 that supported version was ubuntu 16. So I was also trying to run on another system which consists of ubuntu 16. But I was facing Could not locate tensorflow path everytime. But I do have tensorflow installed in my system.
Command : bazel build -c opt //lingvo:trainer
output:
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' modified arguments {"commit": "07ff5feb7c7b113eea593eb6ec50b51099cf0261", "shallow_since": "1524766240 -0700"} and dropped ["tag"]
ERROR: /home/guest/lingvo/lingvo/core/ops/BUILD:24:1: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/guest/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/guest/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path. and referenced by '//lingvo/core/ops:x_ops'
ERROR: Analysis of target '//lingvo:trainer' failed; build aborted: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/guest/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/guest/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path.
INFO: Elapsed time: 1.193s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded, 24 targets configured)
Fetching @protobuf_protoc; fetching
Fetching @tensorflow_solib; fetching
Fetching @tensorflow_includes; fetching
Hi, I am trying to run the above mentioned model in the docker. I was facing the error when I ran the following command,
**command : ** bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
I have a 4 GPU system so I am using split_size=4. When I asked how to try out Gpipe in issue #48 I was given this command and also was asked to modify OneBWdsGPipeTransformer hparams, I haven't done the changes for hparams is the following error because of that? If I need to change something can you help in what hparams I need to change. I am also posting the error logo below:
**Error log : **
while exec the following code,
# Running this cell is equivalent to running the following command:
# (cpu) bazel run -c opt //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=cpu
# (gpu) bazel run -c opt --config=cuda //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=gpu
# Reset the kernel to make sure changes to the model params are re-registered.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=True)
# Start tensorboard (access at http://localhost:6006)
import os
os.system('lsof -t -i:6006 || tensorboard --logdir=/tmp/nqg &')
# Start the trainer
import tensorflow as tf
from lingvo import trainer
argv = [
"trainer.py",
"--model=nqg.train.RNMTModel",
"--mode=sync",
"--logdir=/tmp/nqg",
"--saver_max_to_keep=2",
"--run_locally=gpu", # or cpu.
]
tf.app.run(trainer.main, argv=argv)
the error is
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<ipython-input-1-4b9a3f22cd93> in <module>
9 # Start the trainer
10 import tensorflow as tf
---> 11 from lingvo import trainer
12 argv = [
13 "trainer.py",
5 frames
/code/lingvo/lingvo/core/ops/py_x_ops.py in <module>
24
25 gen_x_ops = tf.load_op_library(
---> 26 tf.resource_loader.get_path_to_datafile('x_ops.so'))
27
28 if 'assert_shape_match' not in dir(gen_x_ops):
~/Library/Python/3.6/lib/python/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename)
54 RuntimeError: when unable to load the library or get the python wrappers.
55 """
---> 56 lib_handle = py_tf.TF_LoadLibrary(library_filename)
57
58 op_list_str = py_tf.TF_GetOpList(lib_handle)
NotFoundError: dlopen(/code/lingvo/lingvo/core/ops/x_ops.so, 6): no suitable image found. Did find:
/code/lingvo/lingvo/core/ops/x_ops.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
/code/lingvo/lingvo/core/ops/x_ops.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00
any one knows how to fix it ?
Line 72 in bda36f1
It seems that we can train on multiple GPUs regardless of whether the mode is set to synchronous or asynchronous. So, could you mind tell me what's the difference of 'sync' mode and 'async' mode? What are their respective advantages?
I just installed lingvo and it looks like I face the issue with protobuf linking. The nightly-tf seems to be up-to-date
mironov@70e0b410070b:~/lingvo$ python -c "import tensorflow as tf;print(tf.__version__)"
1.14.1-dev20190305
The exact error from bazel build is
mironov@70e0b410070b:~/lingvo$ bazel build -c opt //lingvo:trainer
INFO: Analysed target //lingvo:trainer (22 packages loaded).
INFO: Found 1 target...
ERROR: /workspace/lingvo/lingvo/tools/BUILD:98:1: Linking of rule '//lingvo/tools:generate_proto_def' failed (Exit 1)
bazel-out/host/bin/lingvo/tools/_objs/generate_proto_def/generate_proto_def.o:generate_proto_def.cc:function (anonymous namespace)::WriteDotProto(google::protobuf::FileDescriptor const*, char const*): error: undefined reference to 'google::protobuf::FileDescriptor::DebugString() const'
collect2: error: ld returned 1 exit status
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.746s, Critical Path: 2.20s
INFO: 3 processes: 3 processwrapper-sandbox.
FAILED: Build did NOT complete successfully
Could you please check?
I am using tf-nightly 1.14.1-dev20190307.
I am trying to run command :
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.WordLevelOneBwdsSimpleSampledSoftmax --logdir=/tmp/lm1b/log --logtostderr
Error:
Waiting for 12.19 seconds before retrying.
I0417 09:53:59.435111 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I am trying this command on sing machine. it exists after waiting some seconds.
Full error-log:
I0417 09:53:42.698106 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 3.47 seconds before retrying.
I0417 09:53:42.699461 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I0417 09:53:46.173445 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 5.24 seconds before retrying.
I0417 09:53:46.174993 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
2019-04-17 09:53:49.916693: W tensorflow/core/framework/op_kernel.cc:1408] OP_REQUIRES failed at constant_op.cc:76 : Invalid argument: Cannot parse tensor from tensor_proto.
2019-04-17 09:53:49.916769: E tensorflow/core/common_runtime/executor.cc:636] Executor failed to create kernel. Invalid argument: Cannot parse tensor from tensor_proto.
[[{{node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const}}]]
2019-04-17 09:53:50.831555: W tensorflow/core/framework/op_kernel.cc:1408] OP_REQUIRES failed at constant_op.cc:76 : Invalid argument: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1
2019-04-17 09:53:50.831626: E tensorflow/core/common_runtime/executor.cc:636] Executor failed to create kernel. Invalid argument: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1
[[{{node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const}}]]
I0417 09:53:51.035936 139839403964160 base_runner.py:236] controller done (fatal error).
I0417 09:53:51.038496 139839403964160 base_runner.py:115] controller exception: Cannot parse tensor from proto: dtype: DT_FLOAT
tensor_shape {
dim {
size: 99184
}
dim {
size: 512
}
}
float_val: 1
[[node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/optimizer.py:60) ]]
Original stack trace for u'1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const':
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1554, in
tf.app.run(main)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1550, in main
RunnerManager(FLAGS.model).Start()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1543, in Start
self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1311, in CreateRunners
trial)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1265, in _CreateRunner
return self.Controller(cfg, *common_args)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 196, in init
self._model.ConstructFPropBPropGraph()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1229, in ConstructFPropBPropGraph
self._task.BProp()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 500, in BProp
self._BPropForVariables(vs)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 691, in _BPropForVariables
var_update_op = self.optimizer.Apply(lr, self._var_grads)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 63, in Apply
var_update_op = _Apply()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 60, in _Apply
[(g, v) for (v, g) in var_grad.Flatten()], name='meta_backprop')
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 577, in apply_gradients
self._create_slots(var_list)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py", line 80, in _create_slots
"accumulator", self._name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 1114, in _get_or_make_slot_with_initializer
var, initializer, shape, dtype, op_name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 164, in create_slot_with_initializer
dtype)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 74, in _create_slot_var
validate_shape=validate_shape)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1502, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1243, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 567, in get_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 519, in _true_getter
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 934, in _get_single_variable
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 212, in call
return cls._variable_v1_call(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 175, in _variable_v1_call
aggregation=aggregation)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 154, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2519, in default_variable_creator
expected_shape=expected_shape, import_scope=import_scope)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 216, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1443, in init
constraint=constraint)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1551, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 906, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 247, in call
self.value, dtype=dtype, shape=shape, verify_shape=verify_shape)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 179, in constant_v1
allow_broadcast=False)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 289, in _constant_impl
name=name).outputs[0]
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3479, in create_op
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1961, in init
self._traceback = tf_stack.extract_stack()
E0417 09:53:51.039324 139839403964160 base_runner.py:243] Traceback (most recent call last):
E0417 09:53:51.039395 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
E0417 09:53:51.039463 139839403964160 base_runner.py:243] loop_func(*loop_args)
E0417 09:53:51.039511 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 252, in _Loop
E0417 09:53:51.039556 139839403964160 base_runner.py:243] self._RestoreIfNeeded(sess)
E0417 09:53:51.039599 139839403964160 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 314, in _RestoreIfNeeded
E0417 09:53:51.039643 139839403964160 base_runner.py:243] sess.run([self._initialize_all])
E0417 09:53:51.039685 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
E0417 09:53:51.039726 139839403964160 base_runner.py:243] run_metadata_ptr)
E0417 09:53:51.039766 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
E0417 09:53:51.039814 139839403964160 base_runner.py:243] feed_dict_tensor, options, run_metadata)
E0417 09:53:51.039856 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
E0417 09:53:51.039897 139839403964160 base_runner.py:243] run_metadata)
E0417 09:53:51.039937 139839403964160 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
E0417 09:53:51.039978 139839403964160 base_runner.py:243] raise type(e)(node_def, op, message)
E0417 09:53:51.040018 139839403964160 base_runner.py:243] InvalidArgumentError: Cannot parse tensor from proto: dtype: DT_FLOAT
E0417 09:53:51.040071 139839403964160 base_runner.py:243] tensor_shape {
E0417 09:53:51.040110 139839403964160 base_runner.py:243] dim {
E0417 09:53:51.040149 139839403964160 base_runner.py:243] size: 99184
E0417 09:53:51.040189 139839403964160 base_runner.py:243] }
E0417 09:53:51.040227 139839403964160 base_runner.py:243] dim {
E0417 09:53:51.040266 139839403964160 base_runner.py:243] size: 512
E0417 09:53:51.040306 139839403964160 base_runner.py:243] }
E0417 09:53:51.040344 139839403964160 base_runner.py:243] }
E0417 09:53:51.040385 139839403964160 base_runner.py:243] float_val: 1
E0417 09:53:51.040424 139839403964160 base_runner.py:243]
E0417 09:53:51.040462 139839403964160 base_runner.py:243] [[node 1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py:60) ]]
E0417 09:53:51.040510 139839403964160 base_runner.py:243]
E0417 09:53:51.040550 139839403964160 base_runner.py:243] Original stack trace for u'1bwds_word_level_lm/lm/softmax/weight_0/var/Adagrad/Initializer/Const':
E0417 09:53:51.040591 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1554, in
E0417 09:53:51.040630 139839403964160 base_runner.py:243] tf.app.run(main)
E0417 09:53:51.040668 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
E0417 09:53:51.040708 139839403964160 base_runner.py:243] _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
E0417 09:53:51.040747 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
E0417 09:53:51.040806 139839403964160 base_runner.py:243] _run_main(main, args)
E0417 09:53:51.040848 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
E0417 09:53:51.040889 139839403964160 base_runner.py:243] sys.exit(main(argv))
E0417 09:53:51.040927 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1550, in main
E0417 09:53:51.041194 139839403964160 base_runner.py:243] RunnerManager(FLAGS.model).Start()
E0417 09:53:51.041244 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1543, in Start
E0417 09:53:51.041289 139839403964160 base_runner.py:243] self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
E0417 09:53:51.041330 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1311, in CreateRunners
E0417 09:53:51.041371 139839403964160 base_runner.py:243] trial)
E0417 09:53:51.041426 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1265, in _CreateRunner
E0417 09:53:51.041467 139839403964160 base_runner.py:243] return self.Controller(cfg, *common_args)
E0417 09:53:51.041507 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 196, in init
E0417 09:53:51.041548 139839403964160 base_runner.py:243] self._model.ConstructFPropBPropGraph()
E0417 09:53:51.041589 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1229, in ConstructFPropBPropGraph
E0417 09:53:51.041630 139839403964160 base_runner.py:243] self._task.BProp()
E0417 09:53:51.041670 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 500, in BProp
E0417 09:53:51.041711 139839403964160 base_runner.py:243] self._BPropForVariables(vs)
E0417 09:53:51.041764 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 691, in _BPropForVariables
E0417 09:53:51.041842 139839403964160 base_runner.py:243] var_update_op = self.optimizer.Apply(lr, self._var_grads)
E0417 09:53:51.041887 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 63, in Apply
E0417 09:53:51.041929 139839403964160 base_runner.py:243] var_update_op = _Apply()
E0417 09:53:51.041970 139839403964160 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/optimizer.py", line 60, in _Apply
E0417 09:53:51.042010 139839403964160 base_runner.py:243] [(g, v) for (v, g) in var_grad.Flatten()], name='meta_backprop')
E0417 09:53:51.042052 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 577, in apply_gradients
E0417 09:53:51.042092 139839403964160 base_runner.py:243] self._create_slots(var_list)
E0417 09:53:51.042146 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/adagrad.py", line 80, in _create_slots
E0417 09:53:51.042186 139839403964160 base_runner.py:243] "accumulator", self._name)
E0417 09:53:51.042226 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 1114, in _get_or_make_slot_with_initializer
E0417 09:53:51.042272 139839403964160 base_runner.py:243] var, initializer, shape, dtype, op_name)
E0417 09:53:51.042314 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 164, in create_slot_with_initializer
E0417 09:53:51.042354 139839403964160 base_runner.py:243] dtype)
E0417 09:53:51.042392 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/training/slot_creator.py", line 74, in _create_slot_var
E0417 09:53:51.042433 139839403964160 base_runner.py:243] validate_shape=validate_shape)
E0417 09:53:51.042473 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1502, in get_variable
E0417 09:53:51.042511 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042551 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1243, in get_variable
E0417 09:53:51.042591 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042630 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 567, in get_variable
E0417 09:53:51.042670 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042709 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 519, in _true_getter
E0417 09:53:51.042747 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042823 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 934, in _get_single_variable
E0417 09:53:51.042869 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.042910 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 212, in call
E0417 09:53:51.042949 139839403964160 base_runner.py:243] return cls._variable_v1_call(*args, **kwargs)
E0417 09:53:51.042993 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 175, in _variable_v1_call
E0417 09:53:51.043035 139839403964160 base_runner.py:243] aggregation=aggregation)
E0417 09:53:51.043088 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 154, in
E0417 09:53:51.043128 139839403964160 base_runner.py:243] previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
E0417 09:53:51.043167 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 2519, in default_variable_creator
E0417 09:53:51.043206 139839403964160 base_runner.py:243] expected_shape=expected_shape, import_scope=import_scope)
E0417 09:53:51.043246 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 216, in call
E0417 09:53:51.043284 139839403964160 base_runner.py:243] return super(VariableMetaclass, cls).call(*args, **kwargs)
E0417 09:53:51.043324 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1443, in init
E0417 09:53:51.043364 139839403964160 base_runner.py:243] constraint=constraint)
E0417 09:53:51.043402 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1551, in _init_from_args
E0417 09:53:51.043442 139839403964160 base_runner.py:243] initial_value(), name="initial_value", dtype=dtype)
E0417 09:53:51.043482 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 906, in
E0417 09:53:51.043520 139839403964160 base_runner.py:243] shape.as_list(), dtype=dtype, partition_info=partition_info)
E0417 09:53:51.043560 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 247, in call
E0417 09:53:51.043600 139839403964160 base_runner.py:243] self.value, dtype=dtype, shape=shape, verify_shape=verify_shape)
E0417 09:53:51.043638 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 179, in constant_v1
E0417 09:53:51.043678 139839403964160 base_runner.py:243] allow_broadcast=False)
E0417 09:53:51.043718 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 289, in _constant_impl
E0417 09:53:51.043756 139839403964160 base_runner.py:243] name=name).outputs[0]
E0417 09:53:51.043818 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
E0417 09:53:51.043859 139839403964160 base_runner.py:243] return func(*args, **kwargs)
E0417 09:53:51.043900 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3479, in create_op
E0417 09:53:51.043941 139839403964160 base_runner.py:243] op_def=op_def)
E0417 09:53:51.043981 139839403964160 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1961, in init
E0417 09:53:51.044020 139839403964160 base_runner.py:243] self._traceback = tf_stack.extract_stack()
E0417 09:53:51.044060 139839403964160 base_runner.py:243]
E0417 09:53:51.044100 139839403964160 base_runner.py:243]
I0417 09:53:51.420242 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 7.94 seconds before retrying.
I0417 09:53:51.421612 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
I0417 09:53:59.368693 139839395571456 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 454, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
Waiting for 12.19 seconds before retrying.
I0417 09:53:59.435111 139839395571456 trainer.py:456] Probably the expected race on global_step: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
Hi, when I tried to do it directly without docker, I was facing many problems but didn't find a solution. Now I tried using docker and I am facing the following error when I run the command,
command : bazel build -c opt //lingvo:trainer
Error:
INFO: Analysed target //lingvo:trainer (0 packages loaded).
INFO: Found 1 target...
ERROR: missing input file '@tensorflow_solib//:tensorflow_solib/libtensorflow_framework.so'
ERROR: /tmp/lingvo/lingvo/BUILD:150:1: Creating runfiles tree bazel-out/k8-opt/bin/lingvo/trainer.runfiles failed: Process terminated by signal 15: Process terminated by signal 15
ERROR: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/tensorflow_solib/BUILD:2:1: @tensorflow_solib//:framework_lib: missing input file '@tensorflow_solib//:tensorflow_solib/libtensorflow_framework.so'
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/external/tensorflow_solib/BUILD:2:1 1 input file(s) do not exist
INFO: Elapsed time: 0.195s, Critical Path: 0.01s
INFO: 0 processes.
FAILED: Build did NOT complete successfully
:int64*, tensorflow::lingvo::TensorVec*)::__lambda8, const char [22])'
"GenericInputProcessor");
^
lingvo/core/ops/generic_input_op_kernels.cc:115:32: note: candidates are:
In file included from external/tensorflow_includes/tensorflow_includes/tensorflow/core/common_runtime/device.h:43:0,
from external/tensorflow_includes/tensorflow_includes/tensorflow/core/common_runtime/function.h:22,
from lingvo/core/ops/generic_input_op_kernels.cc:18:
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:92:3: note: tensorflow::ScopedStepContainer::ScopedStepContainer(tensorflow::int64, std::function<void(const std::basic_string&)>)
ScopedStepContainer(const int64 step_id,
^
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:92:3: note: candidate expects 2 arguments, 3 provided
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:87:7: note: tensorflow::ScopedStepContainer::ScopedStepContainer(const tensorflow::ScopedStepContainer&)
class ScopedStepContainer {
^
external/tensorflow_includes/tensorflow_includes/tensorflow/core/framework/resource_mgr.h:87:7: note: candidate expects 1 argument, 3 provided
Line 59 in 31f6008
I can not find models.py
under this repo.
I modify the model inference part in "/codelabs/introduction.ipynb" to inference ASR task.
I have tried many variables to feed. However, none of them are correct.
When I feed waveform, it occurs:
---------------------------------------------------------------------------
InternalError Traceback (most recent call last)
<ipython-input-13-533e43df44a2> in <module>()
24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav=wav_file)
27 print(hyps)
28 print(src_frames)
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
206 report_tensor_allocations_upon_oom=False)
207 return self._RunWithValidSession(
--> 208 tf.Session.run, fetches, feed_dict=feeds, options=run_options)
209
210
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
48 for retries in itertools.count(0):
49 try:
---> 50 return func(*args, **kwargs)
51 except retry_value as e:
52 if retries >= max_retries:
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
158 sess_id = self._cur_sess_id
159 try:
--> 160 return fn(self._sess, *args, **kwargs)
161 except py_utils.transient_tf_errors:
162 # self._sess is invalid, most likely due to the worker being preempted.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
946 try:
947 result = self._run(None, fetches, feed_dict, options_ptr,
--> 948 run_metadata_ptr)
949 if run_metadata:
950 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
1169 if final_fetches or final_targets or (handle and feed_dict_tensor):
1170 results = self._do_run(handle, final_targets, final_fetches,
-> 1171 feed_dict_tensor, options, run_metadata)
1172 else:
1173 results = []
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1346 if handle is None:
1347 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1348 run_metadata)
1349 else:
1350 return self._do_call(_prun_fn, handle, feeds, fetches)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
1366 pass
1367 message = error_interpolation.interpolate(message, self._graph)
-> 1368 raise type(e)(node_def, op, message)
1369
1370 def _extend_graph(self):
InternalError: Unable to get element as bytes.
When I feed tensor of waveform, it occurs:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-1a9314d06d8c> in <module>()
24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav=wav_tensor)
27 print(hyps)
28 print(src_frames)
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
206 report_tensor_allocations_upon_oom=False)
207 return self._RunWithValidSession(
--> 208 tf.Session.run, fetches, feed_dict=feeds, options=run_options)
209
210
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
48 for retries in itertools.count(0):
49 try:
---> 50 return func(*args, **kwargs)
51 except retry_value as e:
52 if retries >= max_retries:
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
158 sess_id = self._cur_sess_id
159 try:
--> 160 return fn(self._sess, *args, **kwargs)
161 except py_utils.transient_tf_errors:
162 # self._sess is invalid, most likely due to the worker being preempted.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
946 try:
947 result = self._run(None, fetches, feed_dict, options_ptr,
--> 948 run_metadata_ptr)
949 if run_metadata:
950 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
1120 'For reference, the tensor object was ' +
1121 str(feed_val) + ' which was passed to the '
-> 1122 'feed with key ' + str(feed) + '.')
1123
1124 subfeed_dtype = subfeed_t.dtype.as_numpy_dtype
TypeError: The value of a feed cannot be a tf.Tensor object. Acceptable feed values include Python scalars, strings, lists, numpy ndarrays, or TensorHandles. For reference, the tensor object was Tensor("Const_2:0", shape=(60080,), dtype=int16) which was passed to the feed with key inference/default/wav:0.
When I feed the filename of wav, it occurs:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-15-17fbc70adb3e> in <module>()
24 inference_graph = inference_graph_exporter.InferenceGraphExporter.Export(params)
25 pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
---> 26 hyps, src_frames, en_frames, scores = pred.Run(['hypotheses', 'src_frames', 'encoder_frames', 'scores'], wav='/tmp/librispeech/arctic_a0002.wav')
27 print(hyps)
28 print(src_frames)
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in Run(self, fetch_keys, **kwargs)
206 report_tensor_allocations_upon_oom=False)
207 return self._RunWithValidSession(
--> 208 tf.Session.run, fetches, feed_dict=feeds, options=run_options)
209
210
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/retry.py in wrapper(*args, **kwargs)
48 for retries in itertools.count(0):
49 try:
---> 50 return func(*args, **kwargs)
51 except retry_value as e:
52 if retries >= max_retries:
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/bin/lingvo/ipython_kernel.runfiles/__main__/lingvo/core/predictor.py in _RunWithValidSession(self, fn, *args, **kwargs)
158 sess_id = self._cur_sess_id
159 try:
--> 160 return fn(self._sess, *args, **kwargs)
161 except py_utils.transient_tf_errors:
162 # self._sess is invalid, most likely due to the worker being preempted.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
946 try:
947 result = self._run(None, fetches, feed_dict, options_ptr,
--> 948 run_metadata_ptr)
949 if run_metadata:
950 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
1169 if final_fetches or final_targets or (handle and feed_dict_tensor):
1170 results = self._do_run(handle, final_targets, final_fetches,
-> 1171 feed_dict_tensor, options, run_metadata)
1172 else:
1173 results = []
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1346 if handle is None:
1347 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1348 run_metadata)
1349 else:
1350 return self._do_call(_prun_fn, handle, feeds, fetches)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
1366 pass
1367 message = error_interpolation.interpolate(message, self._graph)
-> 1368 raise type(e)(node_def, op, message)
1369
1370 def _extend_graph(self):
InvalidArgumentError: Header mismatch: Expected RIFF but found /tmp
[[node inference/default/DecodeWav (defined at lingvo/core/predictor.py:93) ]]
Original stack trace for u'inference/default/DecodeWav':
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 499, in start
self.io_loop.start()
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 1073, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 456, in _handle_events
self._handle_recv()
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 486, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 438, in _run_callback
callback(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2714, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2818, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2878, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-15-17fbc70adb3e>", line 25, in <module>
pred = predictor.Predictor(inference_graph, checkpoint=checkpoint, device_type='cpu')
File "lingvo/core/predictor.py", line 93, in __init__
tf.import_graph_def(inference_graph.graph_def, name="")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
_ProcessNewOps(graph)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3733, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3623, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
self._traceback = tf_stack.extract_stack()
I also tried a list of them separately. However, none of them are correct. I have seen inference function in /tasks/asr/model.py and DecodeWav function in /tools/audio_lib.py. So I tried the above ways.
Could you please tell me what should I feed? I extremely expect you to provide me with the inference code of ASR task. Thank you so much!!
Hi, as I was getting error #49 . Now i tried to do it in another environment. And i am facing the below problem when i try to run the command: bazel build -c opt //lingvo:trainer. Before running the main command which was specified to me in #48 i am running the above command before it in the home directory. Please tell me if I am doing any step wrong.
**ERROR : **
Gpipe_error.txt
Hi Lingvo Devs,
TBH, i got confused when new library comes to offer similar features, After skimming the papers i found quiet similar with T2T. My only question is why we need Lingvo, instead using existing T2T to generate new idea? Thanks!
ERROR: error loading package 'lingvo': Encountered error while reading extension file 'subpar.bzl': no such package '@subpar//': Traceback (most recent call last):
File "/home/luban/.cache/bazel/_bazel_luban/b5ef85f1c360696308ba7ab9000cfd03/external/bazel_tools/tools/build_defs/repo/git.bzl", line 166
_clone_or_update(ctx)
File "/home/luban/.cache/bazel/_bazel_luban/b5ef85f1c360696308ba7ab9000cfd03/external/bazel_tools/tools/build_defs/repo/git.bzl", line 72, in _clone_or_update
fail(("error cloning %s:\n%s" % (ctx....)))
error cloning subpar:
I have built a lingvo docker on dockerhub . Nvidia-cuda-10 needs docker-v2, which is not supported by my work env. Thus I use the following settting:
tensorflow: gpu, v1.12.0
lingvo: master
cuda: 9.0
docker: nvidia-docker (not "-v2")
When I ran the example of mnist, I found the hyperparams are set to default values like follows:
task.train.max_steps : 4000000
but where can I set these params? Would I have to directly modify lingvo/tasks/image/params/mnist.py or some code else?
what is the exact training time of the mode i.e. for how many steps/epochs it will do training ?
def async(): is a syntax error in Python >= 3.7 so perhaps asynchronous or non_blocking should be used instead.
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --worker_gpus=2 --logdir=/tmp/librispeech/log --model=asr.librispeech.Librispeech960Wpm --logtostderr --enable_asserts=false >& /tmp/librispeech/log/train.log
The trainer.py process got the exception before training process start. But it can run on cpu.
So, how can I train these models with my gpus?
I have build docker successfully before.
`sudo docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < ${LINGVO_DIR}/docker/dev.dockerfile
Sending build context to Docker daemon 5.12kB
Step 1/19 : ARG cpu_base_image="ubuntu:16.04"
Step 2/19 : ARG base_image=$cpu_base_image
Step 3/19 : FROM $base_image
10.0-cudnn7-runtime-ubuntu16.04: Pulling from nvidia/cuda
34667c7e4631: Pull complete
d18d76a881a4: Pull complete
119c7358fbfc: Pull complete
2aaf13f3eff0: Pull complete
643564d518c8: Pull complete
1fea03e629a4: Pull complete
45402f4cf61d: Pull complete
86f75b2a221d: Downloading
9e547bd511ba: Download complete
EOF`
But when I run docker:
sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
I got the following error:
docker: Error response from daemon: Unknown runtime specified nvidia
Have someone meet the same problem with me? Thanks!
https://github.com/tensorflow/lingvo/blob/master/lingvo/core/py_utils.py#L1381
The docstring of function def MaskGradients(var_grad, grad_mask, grad_onehot)
is wrong, and I do not know the function of this method.
DEBUG: Rule 'subpar' modified arguments {"commit": "07ff5feb7c7b113eea593eb6ec50b51099cf0261", "shallow_since": "1524766240 -0700"} and dropped ["tag"]
ERROR: /home/sck/gitRepo/lingvo/lingvo/core/ops/BUILD:277:1: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path. and referenced by '//lingvo/core/ops:tokenizer_ops_kernels'
ERROR: Analysis of target '//lingvo:trainer' failed; build aborted: no such package '@tensorflow_includes//': Traceback (most recent call last):
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 70
_find_tf_include_path(repo_ctx)
File "/home/sck/gitRepo/lingvo/lingvo/repo.bzl", line 16, in _find_tf_include_path
fail("Could not locate tensorflow ins...")
Could not locate tensorflow installation path.
I was using a conda virtual env and I've install tensorflow for it.. So how could i make it locate tensorflow installation path. Thanks
Hi, I've been training models for almost two days. Today, the GPU utilization dropped suddenly to 0%, but all GPU memory were still occupied by the experiment. Besides, the experimental log does not continue to display any information, whether it is training or error messages.
The upper-left part of the following figure is logs. The lower-left part of the figure shows nvidia-smi.
So anyone know what's going on?
In addition, I tried to install environments without docker. However, it occurs #32 where I paste my error at the end. Could you please help me? Thanks a lot!!!
I have successfully build trainer:
root@d4bff1951ef0:/tmp/lingvo# bazel build -c opt //lingvo:trainer
Starting local Bazel server and connecting to it...
INFO: Analysed target //lingvo:trainer (37 packages loaded, 4708 targets configured).
INFO: Found 1 target...
Target //lingvo:trainer up-to-date:
bazel-bin/lingvo/trainer
INFO: Elapsed time: 4.297s, Critical Path: 0.19s
INFO: 1 process: 1 processwrapper-sandbox.
INFO: Build completed successfully, 5 total actions
and then I want to train with one gpu.
My GPU infos:
GeForce RTX 2070 8GB
But I got the following error infos:
2019-04-26 05:34:50.326859: I lingvo/core/ops/record_yielder.cc:341] Epoch 1 /tmp/librispeech/train/train.tfrecords-*
2019-04-26 05:36:05.261884: I lingvo/core/ops/record_batcher.cc:344] 75 total seconds passed. Total records yielded: 1. Total records skipped: 0
2019-04-26 05:36:15.802940: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcudnn.so.7
2019-04-26 05:36:26.895189: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.121682: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.306301: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.355048: W ./tensorflow/stream_executor/stream.h:1988] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-26 05:36:43.990367: I lingvo/core/ops/record_yielder.cc:313] 0x7f486bfdcf60Basic record yielder exit
......
I0426 05:36:57.030641 139959305754368 trainer.py:270] Save checkpoint done: /tmp/librispeech/Wpm/log/train/ckpt-00000000
2019-04-26 05:36:57.065397: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070022: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070450: I tensorflow/stream_executor/stream.cc:4800] [stream=0x6fa0670,impl=0x6fa0710] did not memcpy host-to-device; source: 0x7f4afbba5740
2019-04-26 05:36:57.070564: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
Aborted (core dumped)
I found this issue tensorflow/tensorflow#24496, and set allow_growth to "True" in lingvo/core/py_utils.py line 394.
session_config.gpu_options.allow_growth = True
(I don't know whether I modify the right place because it is hard for me to find where are the corresponding parameters). But it still can't solve my problem.
I think it may be caused by my poor GPU memory. So, I want to reduce the batch size to reduce the use of GPU memory. But I can't find where to modify batch size in codes. Could you please help me? Thanks a lot.
It is the details:
I0426 05:34:40.309401 139959226857216 base_runner.py:115] step: 0
I0426 05:34:40.994631 139959305754368 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0426 05:34:41.019921 139959305754368 trainer.py:268] Save checkpoint
2019-04-26 05:34:42.855256: W tensorflow/core/framework/allocator.cc:122] Allocation of 200638464 exceeds 10% of system memory.
2019-04-26 05:34:44.004477: W tensorflow/core/framework/allocator.cc:122] Allocation of 200638464 exceeds 10% of system memory.
2019-04-26 05:34:49.756821: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcublas.so.10.0
2019-04-26 05:34:50.238197: I ./lingvo/core/ops/input_common.h:68] Create RecordProcessor
2019-04-26 05:34:50.325285: I lingvo/core/ops/input_common.cc:30] Input source weights are empty, fall back to legacy behavior.
2019-04-26 05:34:50.325678: I lingvo/core/ops/record_yielder.cc:288] 0x7f486bfdcf60 Record yielder start
2019-04-26 05:34:50.325694: I lingvo/core/ops/record_yielder.cc:290] Randomly seed RecordYielder.
2019-04-26 05:34:50.326849: I ./lingvo/core/ops/input_common.h:73] Create batcher
2019-04-26 05:34:50.326859: I lingvo/core/ops/record_yielder.cc:341] Epoch 1 /tmp/librispeech/train/train.tfrecords-*
2019-04-26 05:36:05.261884: I lingvo/core/ops/record_batcher.cc:344] 75 total seconds passed. Total records yielded: 1. Total records skipped: 0
2019-04-26 05:36:15.802940: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcudnn.so.7
2019-04-26 05:36:26.895189: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.121682: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.306301: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-26 05:36:27.355048: W ./tensorflow/stream_executor/stream.h:1988] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-26 05:36:43.990367: I lingvo/core/ops/record_yielder.cc:313] 0x7f486bfdcf60Basic record yielder exit
I0426 05:36:45.892488 139959226857216 base_runner.py:236] trainer done (fatal error).
I0426 05:36:45.903496 139959226857216 base_runner.py:115] trainer exception: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node fprop/librispeech/tower_0_0/enc/conv_L0/convolution (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py:576) ]]
[[gradients/fprop/librispeech/tower_0_0/dec/embedding_lookup_grad/GatherV2_3_G308]]
Errors may have originated from an input operation.
Input Source operations connected to node fprop/librispeech/tower_0_0/enc/conv_L0/convolution:
fprop/librispeech/tower_0_0/enc/mul (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:2481)
fprop/librispeech/Identity_23 (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:1313)
Original stack trace for u'fprop/librispeech/tower_0_0/enc/conv_L0/convolution':
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1557, in <module>
tf.app.run(main)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1553, in main
RunnerManager(FLAGS.model).Start()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1546, in Start
self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1314, in CreateRunners
trial)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1277, in _CreateRunner
return self.Trainer(cfg, *common_args)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 386, in __init__
self._model.ConstructFPropBPropGraph()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 1235, in ConstructFPropBPropGraph
self._task.FPropDefaultTheta()
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 477, in FPropDefaultTheta
return self.FProp(self.theta, input_batch)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 394, in FProp
metrics, per_example = self._FPropSplitInputBatch(theta, input_batch)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 440, in _FPropSplitInputBatch
metrics, per_example = self.FPropTower(theta_local, batch)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 363, in FPropTower
predicted = self.ComputePredictions(theta, input_batch)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 124, in ComputePredictions
encoder_outputs = self._FrontendAndEncoderFProp(theta, input_batch_src)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 156, in _FrontendAndEncoderFProp
return self.encoder.FProp(theta.encoder, input_batch_src)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/encoder.py", line 333, in FProp
out_padding)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 485, in FProp
out = self._Compute(theta, inputs, paddings, conv_padding)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 515, in _Compute
out = self._ApplyConv(theta, inputs, bn_padding_expanded)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 416, in _ApplyConv
out = ComputeRawConvolution(filter_w)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 407, in ComputeRawConvolution
padding_algorithm=padding_algorithm)
File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 576, in _EvaluateConvKernel
padding=padding_algorithm)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 894, in convolution
name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 971, in convolution_internal
name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3598, in create_op
op_def=op_def)
File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
self._traceback = tf_stack.extract_stack()
E0426 05:36:45.934931 139959226857216 base_runner.py:243] Traceback (most recent call last):
E0426 05:36:45.935090 139959226857216 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/base_runner.py", line 196, in _RunLoop
E0426 05:36:45.935159 139959226857216 base_runner.py:243] loop_func(*loop_args)
E0426 05:36:45.935215 139959226857216 base_runner.py:243] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 508, in _Loop
E0426 05:36:45.935441 139959226857216 base_runner.py:243] model_task.per_example_tensors,
E0426 05:36:45.935498 139959226857216 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
E0426 05:36:45.935554 139959226857216 base_runner.py:243] run_metadata_ptr)
E0426 05:36:45.935607 139959226857216 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
E0426 05:36:45.935661 139959226857216 base_runner.py:243] feed_dict_tensor, options, run_metadata)
E0426 05:36:45.935713 139959226857216 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
E0426 05:36:45.935761 139959226857216 base_runner.py:243] run_metadata)
E0426 05:36:45.935817 139959226857216 base_runner.py:243] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
E0426 05:36:45.935863 139959226857216 base_runner.py:243] raise type(e)(node_def, op, message)
E0426 05:36:45.935920 139959226857216 base_runner.py:243] UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
E0426 05:36:45.935975 139959226857216 base_runner.py:243] [[node fprop/librispeech/tower_0_0/enc/conv_L0/convolution (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py:576) ]]
E0426 05:36:45.936033 139959226857216 base_runner.py:243] [[gradients/fprop/librispeech/tower_0_0/dec/embedding_lookup_grad/GatherV2_3_G308]]
E0426 05:36:45.936084 139959226857216 base_runner.py:243]
E0426 05:36:45.936136 139959226857216 base_runner.py:243] Errors may have originated from an input operation.
E0426 05:36:45.936187 139959226857216 base_runner.py:243] Input Source operations connected to node fprop/librispeech/tower_0_0/enc/conv_L0/convolution:
E0426 05:36:45.936237 139959226857216 base_runner.py:243] fprop/librispeech/tower_0_0/enc/mul (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:2481)
E0426 05:36:45.936291 139959226857216 base_runner.py:243] fprop/librispeech/Identity_23 (defined at tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/py_utils.py:1313)
E0426 05:36:45.936342 139959226857216 base_runner.py:243]
E0426 05:36:45.936393 139959226857216 base_runner.py:243] Original stack trace for u'fprop/librispeech/tower_0_0/enc/conv_L0/convolution':
E0426 05:36:45.936444 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1557, in <module>
E0426 05:36:45.936492 139959226857216 base_runner.py:243] tf.app.run(main)
E0426 05:36:45.936542 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
E0426 05:36:45.936590 139959226857216 base_runner.py:243] _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
E0426 05:36:45.936642 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
E0426 05:36:45.936691 139959226857216 base_runner.py:243] _run_main(main, args)
E0426 05:36:45.936743 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
E0426 05:36:45.936791 139959226857216 base_runner.py:243] sys.exit(main(argv))
E0426 05:36:45.936841 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1553, in main
E0426 05:36:45.936891 139959226857216 base_runner.py:243] RunnerManager(FLAGS.model).Start()
E0426 05:36:45.936939 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1546, in Start
E0426 05:36:45.936990 139959226857216 base_runner.py:243] self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
E0426 05:36:45.937038 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1314, in CreateRunners
E0426 05:36:45.937088 139959226857216 base_runner.py:243] trial)
E0426 05:36:45.937138 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 1277, in _CreateRunner
E0426 05:36:45.937187 139959226857216 base_runner.py:243] return self.Trainer(cfg, *common_args)
E0426 05:36:45.937237 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 386, in __init__
E0426 05:36:45.937285 139959226857216 base_runner.py:243] self._model.ConstructFPropBPropGraph()
E0426 05:36:45.937335 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 1235, in ConstructFPropBPropGraph
E0426 05:36:45.937383 139959226857216 base_runner.py:243] self._task.FPropDefaultTheta()
E0426 05:36:45.937434 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 477, in FPropDefaultTheta
E0426 05:36:45.937490 139959226857216 base_runner.py:243] return self.FProp(self.theta, input_batch)
E0426 05:36:45.937540 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 394, in FProp
E0426 05:36:45.937588 139959226857216 base_runner.py:243] metrics, per_example = self._FPropSplitInputBatch(theta, input_batch)
E0426 05:36:45.937638 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 440, in _FPropSplitInputBatch
E0426 05:36:45.937689 139959226857216 base_runner.py:243] metrics, per_example = self.FPropTower(theta_local, batch)
E0426 05:36:45.937736 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/base_model.py", line 363, in FPropTower
E0426 05:36:45.937787 139959226857216 base_runner.py:243] predicted = self.ComputePredictions(theta, input_batch)
E0426 05:36:45.937835 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 124, in ComputePredictions
E0426 05:36:45.937886 139959226857216 base_runner.py:243] encoder_outputs = self._FrontendAndEncoderFProp(theta, input_batch_src)
E0426 05:36:45.937937 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/model.py", line 156, in _FrontendAndEncoderFProp
E0426 05:36:45.937984 139959226857216 base_runner.py:243] return self.encoder.FProp(theta.encoder, input_batch_src)
E0426 05:36:45.938035 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/tasks/asr/encoder.py", line 333, in FProp
E0426 05:36:45.938085 139959226857216 base_runner.py:243] out_padding)
E0426 05:36:45.938134 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 485, in FProp
E0426 05:36:45.938184 139959226857216 base_runner.py:243] out = self._Compute(theta, inputs, paddings, conv_padding)
E0426 05:36:45.938231 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 515, in _Compute
E0426 05:36:45.938282 139959226857216 base_runner.py:243] out = self._ApplyConv(theta, inputs, bn_padding_expanded)
E0426 05:36:45.938330 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 416, in _ApplyConv
E0426 05:36:45.938374 139959226857216 base_runner.py:243] out = ComputeRawConvolution(filter_w)
E0426 05:36:45.938416 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 407, in ComputeRawConvolution
E0426 05:36:45.938456 139959226857216 base_runner.py:243] padding_algorithm=padding_algorithm)
E0426 05:36:45.938496 139959226857216 base_runner.py:243] File "tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/layers.py", line 576, in _EvaluateConvKernel
E0426 05:36:45.938535 139959226857216 base_runner.py:243] padding=padding_algorithm)
E0426 05:36:45.938587 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 894, in convolution
E0426 05:36:45.938632 139959226857216 base_runner.py:243] name=name)
E0426 05:36:45.938674 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 971, in convolution_internal
E0426 05:36:45.938719 139959226857216 base_runner.py:243] name=name)
E0426 05:36:45.938760 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
E0426 05:36:45.938802 139959226857216 base_runner.py:243] data_format=data_format, dilations=dilations, name=name)
E0426 05:36:45.938844 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
E0426 05:36:45.938886 139959226857216 base_runner.py:243] op_def=op_def)
E0426 05:36:45.938930 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
E0426 05:36:45.938975 139959226857216 base_runner.py:243] return func(*args, **kwargs)
E0426 05:36:45.939023 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3598, in create_op
E0426 05:36:45.939069 139959226857216 base_runner.py:243] op_def=op_def)
E0426 05:36:45.939116 139959226857216 base_runner.py:243] File "usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1994, in __init__
E0426 05:36:45.939160 139959226857216 base_runner.py:243] self._traceback = tf_stack.extract_stack()
E0426 05:36:45.939203 139959226857216 base_runner.py:243]
E0426 05:36:45.939245 139959226857216 base_runner.py:243]
W0426 05:36:56.804146 139959305754368 meta_graph.py:447] Issue encountered when serializing __batch_norm_update_dict.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'dict' object has no attribute 'name'
W0426 05:36:56.804761 139959305754368 meta_graph.py:447] Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
I0426 05:36:57.030641 139959305754368 trainer.py:270] Save checkpoint done: /tmp/librispeech/Wpm/log/train/ckpt-00000000
2019-04-26 05:36:57.065397: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070022: I tensorflow/stream_executor/stream.cc:1852] [stream=0x6fa0670,impl=0x6fa0710] did not wait for [stream=0x6f9ffd0,impl=0x6fa0070]
2019-04-26 05:36:57.070450: I tensorflow/stream_executor/stream.cc:4800] [stream=0x6fa0670,impl=0x6fa0710] did not memcpy host-to-device; source: 0x7f4afbba5740
2019-04-26 05:36:57.070564: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
Aborted (core dumped)
bazel run -c opt //lingvo:trainer -- --logtostderr --model=punctuator.codelab.RNMTModel --mode=sync --logdir=/tmp/punctuator --saver_max_to_keep=2 --run_locally=cpu --enable_asserts=false
From the command line or notebook, have the same problem as following:
...
...
I0316 18:10:57.939176 139757224118016 base_runner.py:115] step: 0
W0316 18:11:00.613759 139757240633088 meta_graph.py:447] Issue encountered when serializing __batch_norm_update_dict.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'dict' object has no attribute 'name'
W0316 18:11:00.614701 139757240633088 meta_graph.py:447] Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
I0316 18:11:01.184139 139757240633088 trainer.py:270] Save checkpoint done: /tmp/punctuator/train/ckpt-00000000
I0316 18:11:01.189898 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:11.201571 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
2019-03-16 18:11:18.382848: I ./lingvo/core/ops/input_common.h:63] Create RecordProcessor
2019-03-16 18:11:18.551229: I lingvo/core/ops/input_common.cc:28] Input source weights are empty, fall back to legacy behavior.
2019-03-16 18:11:18.551383: I lingvo/core/ops/record_yielder.cc:167] 0x7f1bb518f940 Record
yielder start
2019-03-16 18:11:18.551424: I lingvo/core/ops/record_yielder.cc:169] Randomly seed RecordYielder.
2019-03-16 18:11:18.551447: I ./lingvo/core/ops/input_common.h:68] Create batcher
2019-03-16 18:11:18.551517: I lingvo/core/ops/record_yielder.cc:217] Epoch 1 /tmp/punctuator_data/train.txt
I0316 18:11:24.267724 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:31.178570 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
I0316 18:11:41.254617 139757240633088 trainer.py:371] Steps/second: 0.000000, Examples/second: 0.000000
Killed
Hi, as you mentioned in #48 about changing OneBWdsGPipeTransformer hparams and then try to run on 8 GPU's and gave the command to run. I did not understand what are those parameters, can I get help which parameters fit for my system. I am using machine consisting of 4 GPU. What ever the parameters I change I am facing segmentation fault core dumped. I am also attaching my system info(GPU).
command : bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4
segmentation fault.txt
system info:
GPU:
sys_info.txt
I use tf-nightly-gpu==1.13.0-dev20181116, but when I run asr task, I got the error below.
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/librispeech/log --logtostderr
INFO:tensorflow:Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 401, in Start
self._RunLoop('trainer', self._Loop)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/base_runner.py", line 173, in _RunLoop
loop_func(*args)
Traceback for above exception (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 416, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
Waiting for 0.11 seconds before retrying.
After running docker, I tried:
bazel test -c opt //lingvo:trainer_test //lingvo:models_test
But some FAIL occurs:
(base) dm@dm-System-Product-Name:/data/xiaoyubei/codes/lingvo$ sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
root@8f7d00c977c0:/tmp/lingvo#
root@8f7d00c977c0:/tmp/lingvo# bazel test -c opt //lingvo:trainer_test //lingvo:models_test
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
DEBUG: Rule 'subpar' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "07ff5feb7c7b113eea593eb6ec50b51099cf0261", shallow_since = "1524766240 -0700" and dropping ["tag"]
INFO: Analysed 2 targets (41 packages loaded, 4890 targets configured).
INFO: Found 2 test targets...
FAIL: //lingvo:trainer_test (shard 4 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log)
FAIL: //lingvo:trainer_test (shard 3 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log)
FAIL: //lingvo:trainer_test (shard 5 of 5) (see /root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log)
FAILED: //lingvo:trainer_test (Summary)
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log
INFO: Elapsed time: 187.356s, Critical Path: 70.16s
INFO: 29 processes: 29 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 45 total actions
//lingvo:models_test PASSED in 9.8s
//lingvo:trainer_test FAILED in 3 out of 5 in 60.9s
Stats over 5 runs: max = 60.9s, min = 3.7s, avg = 17.3s, dev = 22.2s
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_4_of_5/test.log
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_3_of_5/test.log
/root/.cache/bazel/_bazel_root/17eb95f0bc03547f4f1319e61997e114/execroot/__main__/bazel-out/k8-opt/testlogs/lingvo/trainer_test/shard_5_of_5/test.log
Executed 2 out of 2 tests: 1 test passes and 1 fails locally.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command linINFO: Build completed, 1 test FAILED, 45 total actions
root@8f7d00c977c0:/tmp/lingvo#
Hi, I want to test how Gpipe works, when i searched in the web I found about lingvo repository. Can i know how to run it. I mean i didn't find any documentation so I was a little confused.
== cat /etc/issue ===============================================
Linux ml-gpu-ser341.nmg01 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.5 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
Yes
== compiler =====================================================
c++ (Ubuntu 4.8.5-4ubuntu2) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== bazel =====================================================
Build label: 0.17.2
Build time: Fri Sep 21 10:31:42 2018 (1537525902)
Build timestamp: 1537525902
Build timestamp as int: 1537525902
== uname -a =====================================================
Linux ml-gpu-ser341.nmg01 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.16.2
protobuf 3.7.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.14.1-dev20190324
tf.GIT_VERSION = v1.12.0-10956-g044ff96ba3
tf.COMPILER_VERSION = v1.12.0-10956-g044ff96ba3
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /usr/local/nvidia/lib64/:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
./tf_env_collect.sh: line 109: nvidia-smi: command not found
== cuda libs ===================================================
/usr/local/cuda-10.0/lib64/libcudart_static.a
/usr/local/cuda-10.0/lib64/libcudart.so.10.0.130
/usr/local/cuda-10.0/doc/man/man7/libcudart.7
/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
I am not familiar with bazel. So, I don't know how to run "librispeech.03.parameterize_train.sh" "librispeech.04.parameterize_devtest.sh" and asr task. I have tried run "librispeech.03.parameterize_train.sh" with bash.
root@8f7d00c977c0:/tmp/lingvo# ls
CONTRIBUTING.md README.md bazel-genfiles bazel-testlogs docs tf_env_collect.sh
LICENSE WORKSPACE bazel-lingvo codelabs experiments.md
PUBLICATIONS.md bazel-bin bazel-out docker lingvo
root@8f7d00c977c0:/tmp/lingvo# sh ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh
=== First pass, collecting transcripts: train-clean-100
./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: 31: ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: bazel-bin/lingvo/tools/create_asr_features: not found
root@8f7d00c977c0:/tmp/lingvo# ls
CONTRIBUTING.md README.md bazel-genfiles bazel-testlogs docs tf_env_collect.sh
LICENSE WORKSPACE bazel-lingvo codelabs experiments.md
PUBLICATIONS.md bazel-bin bazel-out docker lingvo
root@8f7d00c977c0:/tmp/lingvo# exit
It occurs:
=== First pass, collecting transcripts: train-clean-100
./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: 31: ./lingvo/tasks/asr/tools/librispeech.03.parameterize_train.sh: bazel-bin/lingvo/tools/create_asr_features: not found
So, I think it must run with bazel. But I don't know how. Could you please share me bash scripts step by step to run "librispeech.03.parameterize_train.sh" "librispeech.04.parameterize_devtest.sh" and asr task?
Thanks!!!
Hello guys:
I am confused that I ran into an issue after I installed newest version Docker CE. I was exactly following the instructions. At the very beginning, the first two lines went well:
LINGVO_DIR="/tmp/lingvo" # (change to the cloned lingvo directory, e.g. "$HOME/lingvo") LINGVO_DEVICE="gpu" # (Leave empty to build and run CPU only docker)
Then, I copied the file dev.dockerfile into the correct location (${LINGVO_DIR}/docker/dev.dockerfile)
and ran the third line:
sudo docker build --tag tensorflow:lingvo $(test "$LINGVO_DEVICE" = "gpu" && echo "--build-arg base_image=nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04") - < ${LINGVO_DIR}/docker/dev.dockerfile
However, when I was running the fourth line:
charles@node28:~$ sudo docker run --rm $(test "$LINGVO_DEVICE" = "gpu" && echo "--runtime=nvidia") -it -v ${LINGVO_DIR}:/tmp/lingvo -v ${HOME}/.gitconfig:/home/${USER}/.gitconfig:ro -p 6006:6006 -p 8888:8888 --name lingvo tensorflow:lingvo bash
I got a strange error and I was not able to solve it by Googling anywhere:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=147707 /var/lib/docker/overlay2/85c4fa2cb2bd50d86984c88450aa4a0003c657a8849a15d6b79124b3d62f6650/merged]\\\\nnvidia-container-cli: requirement error: invalid expression\\\\n\\\"\"": unknown.
Can anyone help me out with it or give me some hint?
i have training ASR tasks by 4 GPU sync mode and async mode, but the training was so slow; that is training log:
INFO:tensorflow:time:6.841992
INFO:tensorflow:2019.03.30-21:26:33 step: 24 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2489777 log_pplx/logits:9.2489777 loss:9.2489777 loss/logits:9.2489777 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.987753
INFO:tensorflow:2019.03.30-21:26:40 step: 25 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2211275 log_pplx/logits:9.2211275 loss:9.2211275 loss/logits:9.2211275 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.675498
INFO:tensorflow:2019.03.30-21:26:47 step: 26 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.1932364 log_pplx/logits:9.1932364 loss:9.1932364 loss/logits:9.1932364 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.539548
INFO:tensorflow:2019.03.30-21:26:54 step: 27 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2337065 log_pplx/logits:9.2337065 loss:9.2337065 loss/logits:9.2337065 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.667554
INFO:tensorflow:2019.03.30-21:27:01 step: 28 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2273502 log_pplx/logits:9.2273502 loss:9.2273502 loss/logits:9.2273502 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:7.178711
INFO:tensorflow:2019.03.30-21:27:08 step: 29 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2205391 log_pplx/logits:9.2205391 loss:9.2205391 loss/logits:9.2205391 num_samples_in_batch:128 lr:0.00025000001
INFO:tensorflow:time:6.959177
INFO:tensorflow:2019.03.30-21:27:15 step: 30 fraction_of_correct_next_step_preds:0 fraction_of_correct_next_step_preds/logits:0 log_pplx:9.2193136 log_pplx/logits:9.2193136 loss:9.2193136 loss/logits:9.2193136 num_samples_in_batch:128 lr:0.00025000001
i have seen lingvo/lingvo/tasks/asr/params/librispeech.py, your training one step may be consume 1s, can you give me some advice?
system info:
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 40% 65C P2 81W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:03:00.0 Off | N/A |
| 42% 68C P2 84W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 45% 72C P2 95W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 45% 73C P2 86W / 250W | 11841MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
my command is:
bazel-bin/lingvo/trainer --enable_asserts=false --run_locally=cpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/librispeech/log --logtostderr
the output log is:
I0308 16:13:45.249138 140304451606272 trainer.py:521] step: 905 fraction_of_correct_next_step_preds:0.0082426639 fraction_of_correct_next_step_preds/logits:0.0082426639 log_pplx:4.96068 log_pplx/logits:4.96068 loss:4.96068 loss/logits:4.96068 num_samples_in_batch:48
I0308 16:13:49.943696 140304032200448 trainer.py:371] Steps/second: 0.014225, Examples/second: 0.731300
I0308 16:13:59.952753 140304032200448 trainer.py:371] Steps/second: 0.014223, Examples/second: 0.731181
I0308 16:14:09.964931 140304032200448 trainer.py:371] Steps/second: 0.014221, Examples/second: 0.731061
I0308 16:14:19.976296 140304032200448 trainer.py:371] Steps/second: 0.014218, Examples/second: 0.730942
I0308 16:14:29.987987 140304032200448 trainer.py:371] Steps/second: 0.014216, Examples/second: 0.730823
I0308 16:14:39.996059 140304032200448 trainer.py:371] Steps/second: 0.014214, Examples/second: 0.730704
I0308 16:14:50.002638 140304032200448 trainer.py:371] Steps/second: 0.014211, Examples/second: 0.730585
I0308 16:15:00.011704 140304032200448 trainer.py:371] Steps/second: 0.014209, Examples/second: 0.730466
I0308 16:15:03.377721 140304451606272 trainer.py:521] step: 906 fraction_of_correct_next_step_preds:0.003683995 fraction_of_correct_next_step_preds/logits:0.003683995 log_pplx:4.9685979 log_pplx/logits:4.9685979 loss:4.9685979 loss/logits:4.9685979 num_samples_in_batch:48
I0308 16:15:10.020848 140304032200448 trainer.py:371] Steps/second: 0.014223, Examples/second: 0.731128
I0308 16:15:20.030227 140304032200448 trainer.py:371] Steps/second: 0.014221, Examples/second: 0.731009
I0308 16:15:30.039532 140304032200448 trainer.py:371] Steps/second: 0.014218, Examples/second: 0.730890
I0308 16:15:40.050379 140304032200448 trainer.py:371] Steps/second: 0.014216, Examples/second: 0.730771
I0308 16:15:50.057887 140304032200448 trainer.py:371] Steps/second: 0.014214, Examples/second: 0.730652
I0308 16:16:00.066382 140304032200448 trainer.py:371] Steps/second: 0.014211, Examples/second: 0.730533
I0308 16:16:10.077025 140304032200448 trainer.py:371] Steps/second: 0.014209, Examples/second: 0.730414
I0308 16:16:18.414843 140304451606272 trainer.py:521] step: 907 fraction_of_correct_next_step_preds:0.0043800189 fraction_of_correct_next_step_preds/logits:0.0043800189 log_pplx:4.9653683 log_pplx/logits:4.9653683 loss:4.9653683 loss/logits:4.9653683 num_samples_in_batch:48
1.my task run like this above. But after some steps, the loss seems not well?
2.do you have some recipe about ASR decoding?
Thanks
Is there any plan to add the seq2seq models for the text-to-speech task as well?
When i try to run the command : bazel build -c opt //lingvo:trainer, I am facing the following error. If I am doing anything wrong, please correct me. I am running the above command in the home directory. I am continuing the issue #48. Before running the main command which you specified there, by observing the other readme I ran the above command.
ERROR:
Starting local Bazel server and connecting to it...
INFO: Analysed target //lingvo:trainer (35 packages loaded, 4188 targets configured).
INFO: Found 1 target...
ERROR: /home/guest/lingvo/lingvo/core/ops/BUILD:67:1: undeclared inclusion(s) in rule '//lingvo/core/ops:ascii_tokenizer':
this rule is missing dependency declarations for the following files included by 'lingvo/core/ops/ascii_tokenizer.cc':
'/usr/include/x86_64-linux-gnu/gnu/stubs-64.h'
Target //lingvo:trainer failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 6.231s, Critical Path: 1.86s
INFO: 0 processes.
FAILED: Build did NOT complete successfully
root@e3a29cc3bd18:/tmp/lingvo# bazel test -c opt //lingvo:trainer_test //lingvo:models_test
Extracting Bazel installation...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/root/.cache/bazel/_bazel_root/install/792a28b07894763eaa2bd870f8776b23/_embedded_binaries/A-server.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
ERROR: The 'test' command is only supported from within a workspace.
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Hi. I am attempting to reproduce the ASR librispeech task using Lingvo
My hardwares consist of 16GPU( Cluster x 4 GPU-1080Ti), and i share my storage as NFS.
I changed batch size 96,48 -> 32 (because of OOM)
And i tried to train librispecch 960 Grapheme baseline for 5 days...
(And now i turn off varitional noise now..)
I read your report which need about 11 days for training, but it's gonna be not working on my case...
About 5 days it still at under 40k step.... and WER also stay at about 11%
is it normal speed for my cluster or do i have some problem with network or something...
thanks for your insight.
Hi, I am using ubuntu 18, my bazel version is 0.23.2 and my gcc version is 7.3.0. I am getting the following error when trying to execute the commad "bazel build -c opt //lingvo:trainer"
Error:
bazel_error.txt
It failed while installing kiwisolver.
Building wheel for kiwisolver (setup.py): started
Building wheel for kiwisolver (setup.py): finished with status 'error'
ERROR: Complete output from command /usr/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/tmp/pip-install-AG6Mos/kiwisolver/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-OM6uPW --python-tag cp27:
ERROR: running bdist_wheel
running build
running build_ext
building 'kiwisolver' extension
creating build
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/py
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/kiwisolver.cpp -o build/temp.linux-x86_64-2.7/py/kiwisolver.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from ./kiwi/constraint.h:13:0,
from ./kiwi/kiwi.h:9,
from py/kiwisolver.cpp:9:
./kiwi/strength.h:30:14: warning: 'kiwi::strength::strong' defined but not used [-Wunused-variable]
const double strong = create( 1.0, 0.0, 0.0 );
^
./kiwi/strength.h:32:14: warning: 'kiwi::strength::medium' defined but not used [-Wunused-variable]
const double medium = create( 0.0, 1.0, 0.0 );
^
./kiwi/strength.h:34:14: warning: 'kiwi::strength::weak' defined but not used [-Wunused-variable]
const double weak = create( 0.0, 0.0, 1.0 );
^
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/constraint.cpp -o build/temp.linux-x86_64-2.7/py/constraint.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/expression.cpp -o build/temp.linux-x86_64-2.7/py/expression.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/solver.cpp -o build/temp.linux-x86_64-2.7/py/solver.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/strength.cpp -o build/temp.linux-x86_64-2.7/py/strength.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
};
^
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
py/strength.cpp:92:1: warning: deprecated conversion from string constant to 'char*' [-Wwrite-strings]
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/term.cpp -o build/temp.linux-x86_64-2.7/py/term.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I. -I/usr/include/python2.7 -c py/variable.cpp -o build/temp.linux-x86_64-2.7/py/variable.o
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
creating build/lib.linux-x86_64-2.7
c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/py/kiwisolver.o build/temp.linux-x86_64-2.7/py/constraint.o build/temp.linux-x86_64-2.7/py/expression.o build/temp.linux-x86_64-2.7/py/solver.o build/temp.linux-x86_64-2.7/py/strength.o build/temp.linux-x86_64-2.7/py/term.o build/temp.linux-x86_64-2.7/py/variable.o -o build/lib.linux-x86_64-2.7/kiwisolver.so
c++: error: unrecognized command line option '-Wdate-time'
c++: error: unrecognized command line option '-fstack-protector-strong'
c++: error: unrecognized command line option '-Wdate-time'
c++: error: unrecognized command line option '-fstack-protector-strong'
error: command 'c++' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for kiwisolver
the code for mwer is in which file
hi, @drpngx , the fraction_of_correct_next_step_preds will decrease to zero at step 20k, Is this behavior strange?
hello, when I set session_config.gpu_options.allow_growth = True
in function SessionConfig, it cases "Segmentation fault"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.