GithubHelp home page GithubHelp logo

Comments (10)

drpngx avatar drpngx commented on July 24, 2024 1

Please take a look at the docker test script. Also the tasks/mt README file explains the basics.

from lingvo.

drpngx avatar drpngx commented on July 24, 2024 1

To run on a single machine, use --run_locally=gpu. Sync/async can be specified with --mode=sync or --mode=async.

from lingvo.

drpngx avatar drpngx commented on July 24, 2024

Yes that's expected. Just wait until you see something like Writing Summary@1 or something with step:.

from lingvo.

fanlu avatar fanlu commented on July 24, 2024

the training thread is exit.
this is the following error msg:

INFO:tensorflow:Save checkpoint done: /tmp/librispeech/log/train/ckpt-00000000
INFO:tensorflow:Steps/second: 0.000000, Examples/second: 0.000000
2018-11-20 03:02:10.099328: I ./lingvo/core/ops/input_common.h:54] Create RecordProcessor
2018-11-20 03:02:10.105195: I ./lingvo/core/ops/input_common.h:57] Create yielder
2018-11-20 03:02:10.106866: I lingvo/core/ops/record_yielder.cc:98] 0x7fea12b16210 Record yielder start
2018-11-20 03:02:10.106930: I lingvo/core/ops/record_yielder.cc:100] Randomly seed RecordYielder.
2018-11-20 03:02:10.106962: I ./lingvo/core/ops/input_common.h:65] Create batcher
2018-11-20 03:02:10.106991: I lingvo/core/ops/record_yielder.cc:145] Epoch 1 /tmp/librispeech/train/train.tfrecords-*
2018-11-20 03:02:11.617758: W ./lingvo/core/ops/tokenizer_op_headers.h:68] Too long target 308 DOCTOR CARR'S OBJECTIONS HIS RELUCTANCE TO PART WITH HER MELTED BEFORE THE RADIANCE OF HER SATISFACTION HE HAD NO IDEA THAT KATY WOULD CARE SO MUCH ABOUT IT AFTER ALL IT WAS A GREAT CHANCE PERHAPS THE ONLY ONE OF THE SORT THAT SHE WOULD EVER HAVE MISSUS ASHE COULD WELL AFFORD TO GIVE KATY THIS TREAT HE KNEW
2018-11-20 03:02:11.757344: W ./lingvo/core/ops/tokenizer_op_headers.h:68] Too long target 310 THE JEW AND THE WOMAN CAN LOOK AFTER EACH OTHER HE ADDED ROUGHLY UNTIL WE CAN SEND SOMEBODY FOR THEM IN THE MORNING THEY CAN'T RUN AWAY VERY FAR IN THEIR PRESENT CONDITION AND WE CANNOT BE TROUBLED WITH THEM JUST NOW CHAUVELIN HAD NOT GIVEN UP ALL HOPE HIS MEN HE KNEW WERE SPURRED ON BY THE HOPE OF THE REWARD
2018-11-20 03:02:14.261763: E tensorflow/core/common_runtime/executor.cc:624] Executor failed to create kernel. Not found: No registered 'AssertSameDim0' OpKernel for GPU devices compatible with node {{node AssertSameDim0}}
	.  Registered:  device='CPU'

	 [[{{node AssertSameDim0}}]]
2018-11-20 03:02:14.262749: E tensorflow/core/common_runtime/executor.cc:624] Executor failed to create kernel. Not found: No registered 'AssertSameDim0' OpKernel for GPU devices compatible with node {{node AssertSameDim0}}
	.  Registered:  device='CPU'

	 [[{{node AssertSameDim0}}]]
2018-11-20 03:02:14.321658: I lingvo/core/ops/record_yielder.cc:119] 0x7fea12b16210 Record yielder exit
INFO:tensorflow:trainer exception: NotFoundError()

ERROR:tensorflow:Traceback (most recent call last):
ERROR:tensorflow:  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/base_runner.py", line 165, in _RunLoop
ERROR:tensorflow:    loop_func(*args)
ERROR:tensorflow:  File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/trainer.py", line 472, in _Loop
ERROR:tensorflow:    model_task.trainer_verbose_tensors,
ERROR:tensorflow:  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
ERROR:tensorflow:    run_metadata_ptr)
ERROR:tensorflow:  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
ERROR:tensorflow:    feed_dict_tensor, options, run_metadata)
ERROR:tensorflow:  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
ERROR:tensorflow:    run_metadata)
ERROR:tensorflow:  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
ERROR:tensorflow:    raise type(e)(node_def, op, message)
ERROR:tensorflow:NotFoundError: No registered 'AssertSameDim0' OpKernel for GPU devices compatible with node {{node AssertSameDim0}}
ERROR:tensorflow:	.  Registered:  device='CPU'
ERROR:tensorflow:
ERROR:tensorflow:	 [[{{node AssertSameDim0}}]]
ERROR:tensorflow:	 [[node fprop/librispeech/tower_0_0/enc/brnn_L0/Forward_avUOv5sSe4o (defined at /tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/__main__/lingvo/core/recurrent.py:796) ]]
ERROR:tensorflow:	 [[{{node ArithmeticOptimizer/AddOpsRewrite_add_1_G503}}]]
ERROR:tensorflow:
INFO:tensorflow:Steps/second: 0.056936, Examples/second: 2.732912
INFO:tensorflow:Write summary @1
2018-11-20 03:02:29.422366: I ./lingvo/core/ops/input_common.h:54] Create RecordProcessor
2018-11-20 03:02:29.426912: I ./lingvo/core/ops/input_common.h:57] Create yielder
2018-11-20 03:02:29.428410: I lingvo/core/ops/record_yielder.cc:98] 0x7ff0073da110 Record yielder start
2018-11-20 03:02:29.428473: I lingvo/core/ops/record_yielder.cc:100] Randomly seed RecordYielder.
2018-11-20 03:02:29.428505: I ./lingvo/core/ops/input_common.h:65] Create batcher
2018-11-20 03:02:29.428520: I lingvo/core/ops/record_yielder.cc:145] Epoch 1 /tmp/librispeech/train/train.tfrecords-*

from lingvo.

drpngx avatar drpngx commented on July 24, 2024

Can you run with --enable_asserts=false for now?

from lingvo.

fanlu avatar fanlu commented on July 24, 2024

It's worked when using --enable_asserts=false? but It's only use one gpu card.
Could you please add readme to asr task?

from lingvo.

drpngx avatar drpngx commented on July 24, 2024

Could you report how you started the job? What command? Single machine?

from lingvo.

fanlu avatar fanlu commented on July 24, 2024

this is the command I used.

bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=asr.librispeech.Librispeech960Base --logdir=/tmp/librispeech/log --logtostderr --enable_asserts=false

I found another problem, the loss does not decrease from start.

2018-11-20 04:24:40.685925: W ./lingvo/core/ops/tokenizer_op_headers.h:68] Too long target 301 JUST SITTING AND WATCHING WAS FRUSTRATING PARTICULARLY WHEN IT WAS A DESPERATE EMERGENCY HE DIDN'T OVERVALUE HIS WORTH BUT HE WAS SURE THERE WAS ALWAYS ROOM FOR ANOTHER GUN BY THE TIME HE HAD DRAGGED HIMSELF DOWN TO THE STREET LEVEL A TURBO TRUCK HAD SLAMMED TO A STOP IN FRONT OF THE LOADING PLATFORM
INFO:tensorflow:Steps/second: 0.090317, Examples/second: 4.592744
INFO:tensorflow:Write summary @101
INFO:tensorflow:step:   101 fraction_of_correct_next_step_preds:0.0065470417 fraction_of_correct_next_step_preds/logits:0.0065470417 log_pplx:4.9673233 log_pplx/logits:4.9673233 loss:4.9673233 loss/logits:4.9673233 num_samples_in_batch:48
INFO:tensorflow:step:   102 fraction_of_correct_next_step_preds:0.010749019 fraction_of_correct_next_step_preds/logits:0.010749019 log_pplx:4.9570217 log_pplx/logits:4.9570217 loss:4.9570217 loss/logits:4.9570217 num_samples_in_batch:48
INFO:tensorflow:step:   103 fraction_of_correct_next_step_preds:0.014047265 fraction_of_correct_next_step_preds/logits:0.014047265 log_pplx:4.9854736 log_pplx/logits:4.9854736 loss:4.9854736 loss/logits:4.9854736 num_samples_in_batch:96
INFO:tensorflow:step:   104 fraction_of_correct_next_step_preds:0.010524165 fraction_of_correct_next_step_preds/logits:0.010524165 log_pplx:4.9925981 log_pplx/logits:4.9925981 loss:4.9925981 loss/logits:4.9925981 num_samples_in_batch:48
INFO:tensorflow:step:   105 fraction_of_correct_next_step_preds:0.0068986509 fraction_of_correct_next_step_preds/logits:0.0068986509 log_pplx:5.0027204 log_pplx/logits:5.0027204 loss:5.0027204 loss/logits:5.0027204 num_samples_in_batch:48
INFO:tensorflow:step:   106 fraction_of_correct_next_step_preds:0.010397326 fraction_of_correct_next_step_preds/logits:0.010397326 log_pplx:4.9780407 log_pplx/logits:4.9780407 loss:4.9780407 loss/logits:4.9780407 num_samples_in_batch:48
INFO:tensorflow:Write summary done: step 101
INFO:tensorflow:step:   101, steps/sec: 0.09, examples/sec: 4.59
INFO:tensorflow:Steps/second: 0.089864, Examples/second: 4.595674
INFO:tensorflow:step:   107 fraction_of_correct_next_step_preds:0.010022573 fraction_of_correct_next_step_preds/logits:0.010022573 log_pplx:5.009697 log_pplx/logits:5.009697 loss:5.009697 loss/logits:5.009697 num_samples_in_batch:48
INFO:tensorflow:Steps/second: 0.089948, Examples/second: 4.597317
INFO:tensorflow:Save checkpoint
WARNING:tensorflow:Issue encountered when serializing __batch_norm_update_dict.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'dict' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing __model_split_id_stack.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'list' object has no attribute 'name'
INFO:tensorflow:Save checkpoint done: /tmp/librispeech/log/train/ckpt-00000108
INFO:tensorflow:step:   108 fraction_of_correct_next_step_preds:0.0051525962 fraction_of_correct_next_step_preds/logits:0.0051525962 log_pplx:4.9994984 log_pplx/logits:4.9994984 loss:4.9994984 loss/logits:4.9994984 num_samples_in_batch:48
INFO:tensorflow:Steps/second: 0.090030, Examples/second: 4.598981
INFO:tensorflow:step:   109 fraction_of_correct_next_step_preds:0.011224868 fraction_of_correct_next_step_preds/logits:0.011224868 log_pplx:4.9820132 log_pplx/logits:4.9820132 loss:4.9820132 loss/logits:4.9820132 num_samples_in_batch:48
INFO:tensorflow:Steps/second: 0.090111, Examples/second: 4.600591
INFO:tensorflow:step:   110 fraction_of_correct_next_step_preds:0.0067334538 fraction_of_correct_next_step_preds/logits:0.0067334538 log_pplx:5.0105591 log_pplx/logits:5.0105591 loss:5.0105591 loss/logits:5.0105591 num_samples_in_batch:48
INFO:tensorflow:Steps/second: 0.090191, Examples/second: 4.602178
INFO:tensorflow:step:   111 fraction_of_correct_next_step_preds:0.0084050633 fraction_of_correct_next_step_preds/logits:0.0084050633 log_pplx:4.9993515 log_pplx/logits:4.9993515 loss:4.9993515 loss/logits:4.9993515 num_samples_in_batch:48
INFO:tensorflow:Steps/second: 0.090269, Examples/second: 4.603743
INFO:tensorflow:step:   112 fraction_of_correct_next_step_preds:0.0077442196 fraction_of_correct_next_step_preds/logits:0.0077442196 log_pplx:4.9817152 log_pplx/logits:4.9817152 loss:4.9817152 loss/logits:4.9817152 num_samples_in_batch:48

from lingvo.

drpngx avatar drpngx commented on July 24, 2024

/CC @rprabhavalkar

We have tuned this for our setup, which is async training with 8 machines with 4 GPUs each, so the settings might be different. You can try lowering the learning rate. You can set this in the Task() function, with something like p.train.learning_rate = 1e-6 maybe?

from lingvo.

fanlu avatar fanlu commented on July 24, 2024

Could you give me a command to run this job async in single machine or multi machine?

from lingvo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.