alibaba / fastnn Goto Github PK

View Code? Open in Web Editor NEW

80.0 7.0 19.0 327 KB

FastNN provides distributed training examples that use EPL.

License: Apache License 2.0

Python 97.42% Makefile 0.11% Shell 2.46%

distributed deep-learning models pai

fastnn's Introduction

EPL Examples

This repo contains distributed training examples that use Easy Parallel Library (EPL).

Install dependent libraries

Install requirements

pip install -r requirements.txt

Install EPL

You can refer to EPL installation document for detailed instruction.

fastnn's People

Contributors

Stargazers

Watchers

Forkers

adoda jlscs charles9304 conleykong reedip kuustudio cordate ldw-sh-cn jiapengwei husin123 iamgoldhua marmikreal stevenjokess ajunlonglive seaofocean qq814551366 ssunnyy habvt yfding0208

fastnn's Issues

关于EPL作为开源的、统一了多种并行策略、灵活易用的自研分布式深度学习训练框架。

关于EPL的易用性、用户的模型编程接口和训练接口均基于TensorFlow。基于这点考虑，请问后续会出pytorch版本的吗，或者说tensorflow在这方面有什么优势，谢谢！

resnet示例nccl_communicator报错

环境:

基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器

脚本:

FastNN的resnet脚本

启动命令

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh

报错

2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.397497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.403631: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.433142: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[{{node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater}}]]

Traceback (most recent call last):
  File "resnet_dp.py", line 92, in <module>
    run_model()
  File "resnet_dp.py", line 67, in run_model
    with tf.train.MonitoredTrainingSession(hooks=hooks) as sess:
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in __init__
    super(MonitoredSession, self).__init__(
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
    res = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
    assign_ops = _init_local_resources(self, fn)
  File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 423, in _init_local_resources
    fn(self, local_resources_init_op)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
         [[node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

2台服务器分布式跑resnet_split.py遇到无限等待的情况

环境： nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器
代码： FastNN/resnet/resnet_split.py
执行命令：
服务器1：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh
服务器2：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况：

服务器2的执行情况：

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复，但是没有继续往下运行。
补充： 同样的环境可以分布式运行bert，服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改？

bert示例运行报错 OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error，请问如何解决？

2023-09-27 09:21:54.582250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 5211 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:0b:00.0, compute capability: 7.0)
2023-09-27 09:21:54.583242: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:54368, 1 -> localhost:45069}
2023-09-27 09:21:54.587690: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:45069
2023-09-27 09:21:59.865339: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-09-27 09:21:59.865421: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: unhandled system error
[[{{node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater}}]]
2023-09-27 09:21:59.865506: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-09-27 09:21:59.865408: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:1:
unhandled system error
[[node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater':
File "run_squad_dp.py", line 33, in
tf.app.run()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/epl/FastNN/bert/run_squad.py", line 1255, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=hooks)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3025, in train
return super(TPUEstimator, self).train(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1478, in _train_with_estimator_spec
with training.MonitoredTrainingSession(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
return MonitoredSession(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in init
super(MonitoredSession, self).init(
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
res = fn(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
return self._get_session_manager().prepare_session(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
assign_ops = _init_local_resources(self, fn)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 416, in _init_local_resources
assign_ops = broadcast_variables()
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 354, in broadcast_variables
reduced_variables = comm.broadcast(bcast_variables)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/collective_communicator.py", line 131, in broadcast
comm_pool = self.get_or_create_comm(comm_name, comm_spec, communication_op=Communicator.BROADCAST)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/collective_communicator.py", line 78, in get_or_create_comm
comm = CommunicationPool(self.options.num_communicators,
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/communication_pool.py", line 41, in init
self.communicator_list = [
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/communication_pool.py", line 42, in
build_communicator('{}/group{}'.format(comm_name, index), comm_spec)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/options.py", line 309, in build_communicator
return Communicator.create(
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/base.py", line 98, in create
return impl(shared_name, devices=devices, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl.py", line 79, in init
ops.GraphKeys.LOCAL_RESOURCES, self.build_resource())
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl.py", line 101, in build_resource
self._create_op = self._handle.create(
File "/usr/local/lib/python3.8/dist-packages/epl/communicators/nccl_ops.py", line 165, in create
return _ops.epl_nccl_communicator_creater(
File "", line 1007, in epl_nccl_communicator_creater
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 792, in _apply_op_helper
op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3356, in create_op
return self._create_op_internal(op_type, inputs, dtypes, input_types, name,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 3418, in _create_op_internal
ret = Operation(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

E0927 09:21:59.979781 140333610264384 error_handling.py:75] Error recorded from training_loop: From /job:worker/replica:0/task:1:
unhandled system error
[[node BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'BROADCAST_0_broadcast_pool_group_0/1/EplNcclCommunicatorCreater':

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble