GithubHelp home page GithubHelp logo

alibaba / fastnn Goto Github PK

View Code? Open in Web Editor NEW
79.0 7.0 20.0 327 KB

FastNN provides distributed training examples that use EPL.

License: Apache License 2.0

Python 97.42% Makefile 0.11% Shell 2.46%
distributed deep-learning models pai

fastnn's Introduction

EPL Examples

This repo contains distributed training examples that use Easy Parallel Library (EPL).

Install dependent libraries

  1. Install requirements
pip install -r requirements.txt
  1. Install EPL

You can refer to EPL installation document for detailed instruction.

fastnn's People

Contributors

adoda avatar alibaba-oss avatar charles9304 avatar seaofocean avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastnn's Issues

bert示例运行报错 OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error,请问如何解决?

2023-09-27 09:21:54.582250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 5211 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 7.0)
2023-09-27 09:21:54.583242: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:54368, 1 -> localhost:45069}
2023-09-27 09:21:54.587690: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:45069
2023-09-27 09:21:59.865339: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-09-27 09:21:59.865421: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: unhandled system error
[[{{node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater}}]]
2023-09-27 09:21:59.865506: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-09-27 09:21:59.865408: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:1:
unhandled system error
[[node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater':
File "run_squad_dp.py", line 33, in
tf.app.run()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/epl/FastNN/bert/run_squad.py", line 1255, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=hooks)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3025, in train
return super(TPUEstimator, self).train(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1478, in _train_with_estimator_spec
with training.MonitoredTrainingSession(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
return MonitoredSession(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in init
super(MonitoredSession, self).init(
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
res = fn(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
return self._get_session_manager().prepare_session(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
assign_ops = _init_local_resources(self, fn)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 416, in _init_local_resources
assign_ops = broadcast_variables()
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 354, in broadcast_variables
reduced_variables = comm.broadcast(bcast_variables)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/collective_communicator.py", line 131, in broadcast
comm_pool = self.get_or_create_comm(comm_name, comm_spec, communication_op=Communicator.BROADCAST)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/collective_communicator.py", line 78, in get_or_create_comm
comm = CommunicationPool(self.options.num_communicators,
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/communication_pool.py", line 41, in init
self.communicator_list = [
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/communication_pool.py", line 42, in
build_communicator('{}/group
{}'.format(comm_name, index), comm_spec)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/options.py", line 309, in build_communicator
return Communicator.create(
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/base.py", line 98, in create
return impl(shared_name, devices=devices, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl.py", line 79, in init
ops.GraphKeys.LOCAL_RESOURCES, self.build_resource())
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl.py", line 101, in build_resource
self._create_op = self._handle.create(
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl_ops.py", line 165, in create
return _ops.epl_nccl_communicator_creater(
File "", line 1007, in epl_nccl_communicator_creater
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 792, in _apply_op_helper
op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3356, in create_op
return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3418, in _create_op_internal
ret = Operation(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

E0927 09:21:59.979781 140333610264384 error_handling.py:75] Error recorded from training_loop: From /job:worker/replica:0/task:1:
unhandled system error
[[node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater':

2台服务器分布式跑resnet_split.py遇到无限等待的情况

环境: nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器
代码: FastNN/resnet/resnet_split.py
执行命令:
服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh
服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况:
image
服务器2的执行情况:
image

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。
补充: 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改?

resnet示例nccl_communicator报错

环境:

基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器

脚本:

FastNN的resnet脚本

启动命令

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

报错

2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.397497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.403631: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.433142: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[{{node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater}}]]

Traceback (most recent call last):
  File "resnet_dp.py", line 92, in <module>
    run_model()
  File "resnet_dp.py", line 67, in run_model
    with tf.train.MonitoredTrainingSession(hooks=hooks) as sess:
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
    res = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
    assign_ops = _init_local_resources(self, fn)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 423, in _init_local_resources
    fn(self, local_resources_init_op)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.