GithubHelp home page GithubHelp logo

Comments (7)

fangelyuan avatar fangelyuan commented on July 4, 2024 1

@jonathanasdf
hello, when i run /lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --worker_split_size=4 --worker_gpus=4 --worker_split_size=4
I have a problem,can you tell me how to resolve it.

I0530 07:26:44.508102 140140756334336 trainer.py:305] Load from checkpoint /tmp/mnist/log/train/ckpt-00000000.
I0530 07:26:44.509429 140140756334336 saver.py:1276] Restoring parameters from /tmp/mnist/log/train/ckpt-00000000
I0530 07:26:45.732462 140140747941632 retry.py:68] Retry: caught exception: _WaitTillInit while running FailedPreconditionError: Attempting to use uninitialized value global_step
[[{{node _send_global_step_0}}]]
. Call failed at (most recent call last):
File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 421, in Start
self._RunLoop('trainer', self._Loop)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 196, in _RunLoop
loop_func(*loop_args)
Traceback for above exception (most recent call last):
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/retry.py", line 50, in wrapper
return func(*args, **kwargs)
File "/home/ywx510667/lingvo-master/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 455, in _WaitTillInit
global_step = sess.run(self._model.global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 948, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1171, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call
raise type(e)(node_def, op, message)

from lingvo.

jonathanasdf avatar jonathanasdf commented on July 4, 2024

Please try

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformer --logdir=/tmp/mnist/log --logtostderr --controller_gpus=4 --worker_gpus=4 --worker_split_size=4

(Having to specify controller_gpus is a bug that we will fix)

from lingvo.

jonathanasdf avatar jonathanasdf commented on July 4, 2024

There also seems to be a failing assertion right now with that model, we will look into that too.

from lingvo.

Raviteja1996 avatar Raviteja1996 commented on July 4, 2024

Hi I tried the command you gave me in the above comment. I think it progressed and some where it met with Aborted (core dumped). I am attaching the error log:
**Error log : **
error.txt

from lingvo.

jonathanasdf avatar jonathanasdf commented on July 4, 2024

Yes, there is some error with the model configuration right now. We are sorry about the problem and will update this issue when it is resolved.

from lingvo.

bignamehyp avatar bignamehyp commented on July 4, 2024

The VOCAB_SIZE was incorrectly set. We will fix it asap.

from lingvo.

bignamehyp avatar bignamehyp commented on July 4, 2024

This issue should have been fixed. Please close it if there is no further issue.

from lingvo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.