Comments (7)
@jonathanasdf
hello, when i run /lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4 --worker_gpus=4 --worker_split_size=4
I have a problem,can you tell me how to resolve it.
I0530 07:26:44.508102 140140756334336 trainer.py:305] Load from checkpoint /tmp/mnist/log/train/ckpt-00000000.
I0530 07:26:44.509429 140140756334336 saver.py:1276] Restoring parameters from /tmp/mnist/log/train/ckpt-00000000
I0530 07:26:45.732462 140140747941632 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 455, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
raise type(e)(node_def, op, message)
from lingvo.
Please try
bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --controller_gpus=4 --worker_gpus=4 --worker_split_size=4
(Having to specify controller_gpus is a bug that we will fix)
from lingvo.
There also seems to be a failing assertion right now with that model, we will look into that too.
from lingvo.
Hi I tried the command you gave me in the above comment. I think it progressed and some where it met with Aborted (core dumped). I am attaching the error log:
**Error log : **
error.txt
from lingvo.
Yes, there is some error with the model configuration right now. We are sorry about the problem and will update this issue when it is resolved.
from lingvo.
The VOCAB_SIZE was incorrectly set. We will fix it asap.
from lingvo.
This issue should have been fixed. Please close it if there is no further issue.
from lingvo.
Related Issues (20)
- DeepFusion network structure HOT 1
- DeepFusion Learnable Align Impl Details HOT 1
- How can I get the logits for one whole sequence in the asr task? HOT 3
- when will the deepfusion code be released? HOT 1
- Car models seem to be disabled for now
- Bazel build failure
- Learnable Align Attention Implementation HOT 1
- DeepFusion Readme HOT 4
- DeepFusion reproduce HOT 11
- Cannot run trainer.py with --model=car.waymo_deepfusion.DeepFusionCenterPointPed, undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringERKNS_15OpKernelContextEb HOT 4
- Cannot import py_camera_model_ops from waymo_open_dataset.camera.ops HOT 3
- images
- Question about DeepFusion
- question about RandomVectorQuantizer
- cannot import name "hyperparams_pb2' from lingvo.core' how to deal with HOT 1
- Switch from prebuilt protoc to build from source
- Raw dependency on "//third_party/py/flax/training:checkpoints"
- unreplicate_metrics=True fails on my training
- Feature request: lingvo.jax.asserts.HasShape HOT 1
- RFC: lingvo.jax exception flag mechanism
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lingvo.