pytorch / elastic Goto Github PK

View Code? Open in Web Editor NEW

728.0 37.0 97.0 3.93 MB

PyTorch elastic training

License: BSD 3-Clause "New" or "Revised" License

Python 55.42% Dockerfile 1.86% Shell 2.12% Makefile 2.21% Go 38.38%

elastic's Introduction

TorchElastic

IMPORTANT: This repository is deprecated.

TorchElastic has been upstreamed to PyTorch 1.9 under torch.distributed.elastic. Please refer to the PyTorch documentation here.
The TorchElastic Controller for Kubernetes is no longer being actively maintained in favor of TorchX.

elastic's People

Contributors

Stargazers

Watchers

Forkers

isunjin ziyiliubird kuan-li ssitb czzlegend jeffwan mannatsingh zw0610 awesome-archive raviskolli kiukchung kit1980 mrshenli rohan-varma stjordanis umialpha chauhang vinamrabenara jlin27 aashnamsft aashna shjwudp d4l3k aivanou csivanou lliai nunofernandes-plight mbrukman eshnil2000 chenk008 tsivaguru vreis vibhatha mszhanyi jiaqianjing bobliu20 hahaxun machinelearningzuu freegliboracle terrorizer1980 sophontec huangjundashuaige zarchary yifuwang kuikuikuizzz ruipeterpan osalpekar chayryali in-the-house rejinjoy18 holttechnologycorporation samuelmarks gaocegege abhinavs95 chaoso megatronga goodoid mansfield6 jenhaoyang global-localhost global19 global19-atlassian-net jenot15 jiatianwu isabella232 xavirostudio gridai assapin shizukanaskytree brianjo h-huang georgesen mellis-github hwfan jasperzhong yiranjing jonathan-conder-sm dmitsf maidousi rashadmorton12 eth-easl classicvalues iitians davejscott zerotoone01 mackzacka dziganto armbiant aetheriaxai gg-big-org atalman lizhengmathai barry-jin dragon-flyings tianhao909 naveensrik79

elastic's Issues

Torch Elastic - How to make sure all nodes are in the same AZ?

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

Hi, when using TorchElastic + AWS EKS, how can we ensure that multi-node training jobs have all of the nodes located in the same AZ? This is critical for multi-node training jobs, in terms of speed of data transfer and data transfer costs.

One naive way would be to just specify 1 subnet when creating the EKS cluster, but is there a way we can create an EKS cluster with multiple subnets, and when TorchElastic attempts to launch multiple nodes for a training job, it will try to launch them such that all of the nodes are located within 1 subnet/AZ (where that subnet would be one of the subnets that the EKS cluster has)? And is this possible to do with spot instances?

Thanks!

Job fails when the number of dataset are not exactly divided by the batch size

❓ Questions and Help

I am trying resnet101 and imagenet example with tiny imagenet dataset (https://tiny-imagenet.herokuapp.com/ ).

I notice there's a failure in the customized sampler

elastic/torchelastic/utils/data/elastic_distributed_sampler.py

Lines 38 to 43 in f5a0968

 if start_index >= len(dataset): 

 raise ValueError( 

 "Start index {} should be less than dataset size {}".format( 

 start_index, len(dataset) 

 ) 

 )

I have two questions

Should I specify dropLast=True in DataLoader initialization?

elastic/examples/imagenet/main.py

Lines 182 to 189 in f5a0968

 self.data_loader = torch.utils.data.DataLoader( 

 self.dataset, 

 batch_size=self.total_batch_size, 

 shuffle=(sampler is None), 

 num_workers=num_data_workers, 

 pin_memory=True, 

 sampler=sampler, 

 multiprocessing_context=None if num_data_workers == 0 else "forkserver",

If start_index == len(dataset), it should finish the current epoch. Should we change the condition to start_index > len(dataset)?

elastic/torchelastic/utils/data/elastic_distributed_sampler.py

Lines 38 to 43 in f5a0968

 if start_index >= len(dataset): 

 raise ValueError( 

 "Start index {} should be less than dataset size {}".format( 

 start_index, len(dataset) 

 ) 

 )

17:31:57
[INFO] 2020-01-02 01:31:57,146 main: epoch: 0, iteration: 1560, data_idx: 149760
[INFO] 2020-01-02 01:31:57,882 main: epoch: 0, iteration: 1561, data_idx: 149856
[INFO] 2020-01-02 01:31:58,451 main: epoch: 0, iteration: 1562, data_idx: 149952
[ERROR] 2020-01-02 01:31:58,497 coordinator_p2p: Rank: 0
Error: Start index 100032 should be less than dataset size 100000
ErrorType: <class 'ValueError'>
StackTrace: Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 36, in __next__
return next(self._iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
index = self._next_index() # may raise StopIteration
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 318, in _next_index
return next(self._sampler_iter) # may raise StopIteration
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 96, in train
state, worker_stats = train_step(state)
File "/tmp/fetch_and_run_shqk63yy/main.py", line 285, in train_step
input, target = next(state.data_iter)
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 40, in __next__
self._iter = self._generator_fn(self._epoch)
File "/tmp/fetch_and_run_shqk63yy/main.py", line 177, in _data_iter_generator_fn
start_index=self.data_start_index,
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/elastic_distributed_sampler.py", line 41, in __init__
start_index, len(dataset)
ValueError: Start index 100032 should be less than dataset size 100000
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 36, in __next__
return next(self._iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
index = self._next_index() # may raise StopIteration
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 318, in _next_index
return next(self._sampler_iter) # may raise StopIteration
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/fetch_and_run_shqk63yy/main.py", line 430, in <module>
main()
File "/tmp/fetch_and_run_shqk63yy/main.py", line 411, in main
args.input_path,
File "/tmp/fetch_and_run_shqk63yy/main.py", line 276, in single_trainer
torchelastic.train(coordinator, train_step, state)
File "/opt/conda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 96, in train
state, worker_stats = train_step(state)
File "/tmp/fetch_and_run_shqk63yy/main.py", line 285, in train_step
input, target = next(state.data_iter)
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 40, in __next__
self._iter = self._generator_fn(self._epoch)
File "/tmp/fetch_and_run_shqk63yy/main.py", line 177, in _data_iter_generator_fn
start_index=self.data_start_index,
File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/elastic_distributed_sampler.py", line 41, in __init__
start_index, len(dataset)
ValueError: Start index 100032 should be less than dataset size 100000

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

A pytorch-native ElasticDDP module proposal

Motivation/Background

Whenever PET needs to scale up, scale down or recover from failure, the ElasticAgent kills and restarts the process. The major benefit of this approach is that users can leverage elasticity without any change of the code. However, it has some drawbacks.

Killing and Restarting typically takes 30 secs or even more. This overhead puts a limit on the frequency of parallelism adjustments.
PET offers the elasticity at the epoch granularity, which also puts a limit on the frequency of parallelism adjustments.

Description

Inspired by the PET V0.1, I would like to suggest adding a pytorch-native ElasticDDP module. It has the same distributed training ability as DDP, but offers more advanced abilities at the same time.

Without PET framework, ElasticDDP works just like DDP.
It contains more states that need to be synchronized(e.g. optimizer, current epoch, current batch index).
It can (re)initialize the process group, (re)synchronize states during runtime.
It exposes extra APIs to cooperate with PET framework, enabling finer-grained elastic control on the training process, i.e. at batch level.

Detailed Proposal

Worker

Preparation. Preparation includes any no-retryable startup logic(model definition, memory allocation, etc.)
Wait Barrier. After preparation, Worker waits for ElasticAgent's signal.
(Re)Init Process Group. When the barrier is ready, worker (re)initialize a new process group.
(Re)Sync. Before training, worker synchronize model states, optimizer states, current batch, current epoch, etc.
Train. User-provided training function is executed with the state object and any additional user parameters.
Failed?
a. If allreduce error occurs, we restore from last committed state and try to rebuild process group.
b. If other internal error occurs, we exit as failed work.
Commit. At this point, worker checks whether there is need to re-initialize the process group.
a. If there is ScaleUp signal, go to (Re)Init Process Group.
b. If there is ScaleDown signal, go to (Re)Init Process Group. We separate this is signal from ScaleUp because actually we do not need to (Re)Sync.
c. If there is Failure signal, go to Wait Barrier.

ElasticAgent
The main logic is similar to the current logic.The differences are:

when the group changes( scale up, scale down, failure recovery), it does not kill the worker immediately, instead, it performs the next rendezvous in parallel with Worker training, when it is ready, it sends signal to the Worker, notifying it to reinitialize process group after it commits.
When performing the next rendezvous, it must guarantee the oldest Workers occupy the lowest rank, because ElasticDDP synchronizes states from rank 0.

Scenarios

Scale Up

New Node:
Worker does preparation while ElasticAgent performs the next barrier at the same time. When the worker is ready, it waits for the ElasticAgent's signal.

Old Node:
Worker does current training process while ElasticAgent performs the next barrier at the same time. When Worker commit, it checks whether it is time to re-initialize the new process group.

Scale Down

We apply graceful exit, When receiving a SIGTERM signal. the ElasticAgent will

notify other ElasticAgents to performs the next barrier.
send GracefulExit signal to Worker to stop after current batch.
Other ElasticAgents will also send GracefulExit signal to their own Worker to stop after current batch.

Failure

When we commit, we can save the ElasticDDP states into memory(we can control the saving frequencies). When failure occurs, we restore from the last states we saved into memory.

Actually, there are two types of failure, node failure and process failure.

When node fails, the existing ElasticAgent will rebuild the process group and continue the work.
When process fails, it is hard to determine which kind of failure has occurred. For similarity, we interpret process failure the same as the node failure.

Alternatives

Horovod has its first version of elastic under review. I absorb a lot of ideas from its design
paper: Elastic deep learning in multi-tenant GPU cluster takes a deep insight into elasticity abilities. It is where my initial thought came from.
alibaba-edl provides k8s-native elasticity for tensorflow.

TODO

Data access is complicated with elasticity. It needs to discuss in a separate PR.

Providing keys and certs to access etcd

Is there a way to provide keys and certs to use for accessing the etcd endpoint?
Similar to these CURL options: --cacert /etc/kubernetes/certs/ca.crt --cert /etc/kubernetes/certs/client.crt --key /etc/kubernetes/certs/client.key

python-etcd seems to have ways to setup this, EtcdRendezvous doesn't.

Define a TrainStepRetryableException and catch it in train_loop.py

🚀 Feature

Currently train_loop catches RuntimeError from train_step and retries but Exception does not retry. This implies that RuntimeErrors are considered retryable versus Exception is not. Clearly define Retryable vs NonRetryable exceptions for train_step and use that to decide whether the train_step should be rolled back and retried.

Motivation

Better, clearer API and exception handling and retry logic

Pitch

See description

Alternatives

N/A

Additional context

Link to code: https://github.com/pytorch/elastic/blob/master/torchelastic/train_loop.py#L123

Pytorch Elastic with NCCL backend

I have gone through the code and tested a couple of examples. I have a question on running with NCCL backend with the init_method env://.

In this method the master address and master port are required. But in the examples we don't specify this as a ENV variable. How is this handled within the code?
Rz refers to Rendezvous

Or as the custom Rz-handler used here is the ETCDRzHandler, it is the one being registered and called and returned to the launcher via the torch.dist.rz

In addition I want to know, when multi-node multi-gpu setting is used,

Does the each Agent in each node has their own process group or is there only a single process group for all processes in all nodes?

If each node has separate process groups, how do they communicate to each other?

Kubernetes controller doesn't handle replicas scale down well

🐛 Bug

This is a known issue on the dependency side. I have a PR merged in the dependency repo here
kubeflow/common#58.

Update dependency will resolve this problem.

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Run example https://github.com/pytorch/elastic/blob/master/kubernetes/config/samples/imagenet.yaml
down scale replica to 1 and apply the job.

Pod and service with index 1 are not deleted.

Expected behavior

Pods and headless service should be removed.

Additional context

ValueError: host not found: Name or service not known with torchelastic

Question

I have a piece of code that runs with torch.distributed.launch.
I modified the code to launch via torchelastic.distributed.launch in the prescribed manner as seen in the examples from the github page and the official pytorch documetation.

However whenI run the code I get the following error

`[INFO] 2020-07-20 14:09:18,781 launch: Running torchelastic.distributed.launch with args: ['/usr/local/lib/python3.6/dist-packages/torchelastic/distributed/launch.py', '--nproc_per_node=4', '--nnodes=2', '--rdzv_id=0', '--rdzv_endpoint=172.20.74.130:2379', 'tools/train_net_elastic.py', '--config-file', 'configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml']
INFO 2020-07-20 14:09:18,785 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-07-20 14:09:18,789 launch: Using nproc_per_node=4.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[INFO] 2020-07-20 14:09:19,358 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-07-20 14:09:19,358 api: [default] Rendezvous'ing worker group
INFO 2020-07-20 14:09:19,358 Attempting to join next rendezvous
INFO 2020-07-20 14:09:19,362 Observed existing rendezvous state: {'status': 'final', 'version': '18', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_0/rdzv/v_18/rank_1', '/torchelastic/p2p/run_0/rdzv/v_18/rank_0'], 'num_workers_waiting': 0}
INFO 2020-07-20 14:09:19,439 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "18", "participants": [0, 1], "keep_alives": ["/torchelastic/p2p/run_0/rdzv/v_18/rank_1", "/torchelastic/p2p/run_0/rdzv/v_18/rank_0"], "num_workers_waiting": 1}
INFO 2020-07-20 14:09:19,441 Keep-alive key /torchelastic/p2p/run_0/rdzv/v_18/rank_1 is not renewed.
INFO 2020-07-20 14:09:19,441 Rendevous version 18 is incomplete.
INFO 2020-07-20 14:09:19,441 Attempting to destroy it.
INFO 2020-07-20 14:09:19,442 Destroyed rendezvous version 18 successfully.
INFO 2020-07-20 14:09:19,442 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-07-20 14:09:19,442 Attempting to join next rendezvous
INFO 2020-07-20 14:09:19,447 New rendezvous state created: {'status': 'joinable', 'version': '19', 'participants': []}
INFO 2020-07-20 14:09:19,535 Joined rendezvous version 19 as rank 0. Full state: {'status': 'joinable', 'version': '19', 'participants': [0]}
INFO 2020-07-20 14:09:19,535 Waiting for remaining peers.
INFO 2020-07-20 14:09:27,463 All peers arrived. Confirming membership.
INFO 2020-07-20 14:09:27,492 Waiting for confirmations from all peers.
INFO 2020-07-20 14:09:27,527 Rendezvous version 19 is complete. Final state: {'status': 'final', 'version': '19', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_0/rdzv/v_19/rank_0', '/torchelastic/p2p/run_0/rdzv/v_19/rank_1'], 'num_workers_waiting': 0}
INFO 2020-07-20 14:09:27,527 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-07-20 14:09:27,593 api: [default] Rendezvous complete for workers.
Result:
restart_count=0
group_rank=0
group_world_size=2
rank stride=4
assigned global_ranks=[0, 1, 2, 3]
master_addr=XXXXX
master_port=55673

[INFO] 2020-07-20 14:09:27,593 api: [default] Starting worker group
Traceback (most recent call last):
File "tools/train_net_elastic.py", line 180, in
launch_elastic()
File "tools/train_net_elastic.py", line 170, in launch_elastic
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known
`
my other node has the following

`python -m torchelastic.distributed.launch --nproc_per_node=4 --nnodes=2 --rdzv_id=0 --rdzv_endpoint=$ETCD_SERVICE_SERVICE_HOST:$ETCD_SERVICE_SERVICE_PORT tools/train_net_elastic.py --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml
[INFO] 2020-07-20 14:09:26,796 launch: Running torchelastic.distributed.launch with args: ['/usr/local/lib/python3.6/dist-packages/torchelastic/distributed/launch.py', '--nproc_per_node=4', '--nnodes=2', '--rdzv_id=0', '--rdzv_endpoint=172.20.74.130:2379', 'tools/train_net_elastic.py', '--config-file', 'configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml']
INFO 2020-07-20 14:09:26,801 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-07-20 14:09:26,808 launch: Using nproc_per_node=4.

[INFO] 2020-07-20 14:09:27,369 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-07-20 14:09:27,370 api: [default] Rendezvous'ing worker group
INFO 2020-07-20 14:09:27,370 Attempting to join next rendezvous
INFO 2020-07-20 14:09:27,375 Observed existing rendezvous state: {'status': 'joinable', 'version': '19', 'participants': [0]}
INFO 2020-07-20 14:09:27,462 Joined rendezvous version 19 as rank 1. Full state: {'status': 'frozen', 'version': '19', 'participants': [0, 1], 'keep_alives': []}
INFO 2020-07-20 14:09:27,462 Waiting for remaining peers.
INFO 2020-07-20 14:09:27,464 All peers arrived. Confirming membership.
INFO 2020-07-20 14:09:27,528 Waiting for confirmations from all peers.
INFO 2020-07-20 14:09:27,529 Rendezvous version 19 is complete. Final state: {'status': 'final', 'version': '19', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_0/rdzv/v_19/rank_0', '/torchelastic/p2p/run_0/rdzv/v_19/rank_1'], 'num_workers_waiting': 0}
INFO 2020-07-20 14:09:27,529 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-07-20 14:09:27,593 api: [default] Rendezvous complete for workers.
Result:
restart_count=0
group_rank=1
group_world_size=2
rank stride=4
assigned global_ranks=[4, 5, 6, 7]
master_addr=XXX
master_port=55673
[INFO] 2020-07-20 14:09:27,593 api: [default] Starting worker group
Traceback (most recent call last):
File "tools/train_net_elastic.py", line 180, in
launch_elastic()
File "tools/train_net_elastic.py", line 170, in launch_elastic
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known
`

My etcd works for the examples in given, and I cannot seem to find the source of this on Google, StackExchange or as an Issue on this repository.

I have initialized my distribution as
dist.init_process_group(backend="nccl", init_method="env://")

Could you point me in the correct direction for debugging this?

Make `straggler_threshold` configurable

🚀 Feature

See: https://github.com/pytorch/elastic/blob/master/torchelastic/p2p/coordinator_p2p.py#L211
Currently it is hard coded to 0.8:

        # Value 0.8 hardcoded here is chosen to be consistent with EDPM.
        straggler_threshold = 0.8
        if self.last_relative_prog_rate < straggler_threshold:
            self.is_worker_straggler = True
            log.warning(
                f"Straggler monitor: rank {self.rank} is slow "
                f"with relative performance of {self.last_relative_prog_rate} "
                f"(threshold: {straggler_threshold})"
            )
        else:
            self.is_worker_straggler = False

Motivation

0.8 might not be the best value for all use-cases, make this configurable.

Pitch

N/A

Alternatives

N/A

Additional context

0.8 was taken from a legacy job.

Add example with checkpointing and recovery for Kubernetes

Description

Add a new training example for Kubernetes that demos how to do checkpointing and use it for fault tolerant recovery.

Motivation/Background

To make it easier for model developers to benefit from the fault tolerant feature of torchelastic.

Detailed Proposal

Simple example can include checkpoint on every worker node, or just on the rank 0 node and describe how to recover from it.
On EKS provide steps for how to use with spot instances
In the documentation, describe the typical strategies that can be used for checkpoint and recovery (eg checkpoint on every node, on the parameter server node, on the rank 0 node etc).

Additional context/links

For reference see examples in ray cluster on spot instances: https://ray.readthedocs.io/en/latest/tune-distributed.html#example-for-using-spot-instances-aws

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1)

🐛 Bug

When running the imagenet example from examples/imagenet,
I get the following error:

[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Traceback (most recent call last):
File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Component (check all that applies):

To Reproduce

See environment

Expected behavior

Training should work and accuracy should be reported correctly

Environment

Dockerfile:

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime

RUN apt-get -q update && apt-get -q install -y wget unzip
RUN pip install torchelastic==0.2.2

RUN mkdir ./train
COPY elastic/examples/imagenet/main.py ./train
WORKDIR ./train
RUN chmod -R a+w .
USER root
ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"]
CMD ["--help"]

Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic

Description

We've introduced a new option in PyTorch that significantly improves reliability in PyTorch DDP training jobs that use NCCL. More info can be found here: pytorch/pytorch#46874. This option essentially prevents stuck collectives from blocking DDP training progress by catching these stuck collectives and bringing down the training process. This introduces no performance overhead to regular training, and when used along with Torchelastic, could be used to quickly restart training from a previous checkpoint. Due to this added reliability, this option should be the default behavior for torchelastic training runs.

This behavior is enabled by setting the environment variable NCCL_ASYNC_ERROR_HANDLING. Furthermore, the duration after which collectives are considered stuck can be tuned by passing in an appropriate timeout argument to the init_process_group call.

How to run elastically on kubernetes (nnodes vs worker replicas)

Question

On the frontpage README.md of the repo it says to run Elastic on 1 ~ 4 nodes, 8 trainers/node, total 8 ~ 32 trainers. Job starts as soon as 1 node is healthy, you may add up to 4 nodes.

python -m torchelastic.distributed.launch
            --nnodes=1:4
            --nproc_per_node=8
            --rdzv_id=JOB_ID
            --rdzv_backend=etcd
            --rdzv_endpoint=ETCD_HOST:ETCD_PORT
            YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

In the docs for the kube example it says:

set Worker.replicas to the number of nodes to start with (you may modify this later to scale the job in/out)

When I try to run with a minReplicas of 1, maxReplicas of 2, and replicas of 2 and the autoscaling group for my training nodes only has one node available training starts with the one available node, and then the second one joins in when it can, but it seems to reset progress 👇, is this expected because we haven't hit a checkpoint yet? Is this desired? Especially in a world where we're using spot instances, how can I make sure I don't get stuck in a loop similar to this redoing the same epoch?

Instance: [i-015067026ed8f10a3] Epoch: [0][1830/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 4.8472e+00 (5.0661e+00)    Acc@1   3.12 (  3.16)   Acc@5   9.38 ( 11.34)
Instance: [i-015067026ed8f10a3] Epoch: [0][1840/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 4.7636e+00 (5.0644e+00)    Acc@1   9.38 (  3.17)   Acc@5  18.75 ( 11.37)
INFO 2020-04-27 18:52:52,967 Etcd machines: ['http://0.0.0.0:2379']
INFO 2020-04-27 18:52:53,585 Attempting to join next rendezvous
Instance: [i-015067026ed8f10a3] Epoch: [0][1850/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 4.6797e+00 (5.0630e+00)    Acc@1   3.12 (  3.17)   Acc@5  18.75 ( 11.39)
Instance: [i-015067026ed8f10a3] Epoch: [0][1860/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 5.0609e+00 (5.0614e+00)    Acc@1   6.25 (  3.17)   Acc@5  15.62 ( 11.42)
INFO 2020-04-27 18:52:53,587 Observed existing rendezvous state: {'status': 'final', 'version': '10', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0'], 'num_workers_waiting': 0}
Instance: [i-015067026ed8f10a3] Epoch: [0][1870/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 4.1825e+00 (5.0594e+00)    Acc@1   6.25 (  3.18)   Acc@5  28.12 ( 11.46)
INFO 2020-04-27 18:52:53,628 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "10", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0"], "num_workers_waiting": 1}
Instance: [i-015067026ed8f10a3] Epoch: [0][1880/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 5.0155e+00 (5.0574e+00)    Acc@1   0.00 (  3.18)   Acc@5   6.25 ( 11.47)
Instance: [i-015067026ed8f10a3] Epoch: [0][1890/3125]   Time  0.139 ( 0.137)    Data  0.034 ( 0.034)    Loss 4.8805e+00 (5.0552e+00)    Acc@1   9.38 (  3.21)   Acc@5  18.75 ( 11.53)
INFO 2020-04-27 18:52:58,719 Attempting to join next rendezvous
INFO 2020-04-27 18:52:58,722 Observed existing rendezvous state: {'status': 'final', 'version': '10', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0'], 'num_workers_waiting': 1}
INFO 2020-04-27 18:52:58,782 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "10", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0"], "num_workers_waiting": 2}
INFO 2020-04-27 18:53:08,501 Keep-alive key /torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0 is not renewed.
INFO 2020-04-27 18:53:08,501 Rendevous version 10 is incomplete. 
INFO 2020-04-27 18:53:08,501 Attempting to destroy it.
INFO 2020-04-27 18:53:08,502 Keep-alive key /torchelastic/p2p/run_imagenet/rdzv/v_10/rank_0 is not renewed.
INFO 2020-04-27 18:53:08,502 Destroyed rendezvous version 10 successfully.
INFO 2020-04-27 18:53:08,502 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-04-27 18:53:08,502 Rendevous version 10 is incomplete. 
INFO 2020-04-27 18:53:08,502 Attempting to destroy it.
INFO 2020-04-27 18:53:08,503 Rendezvous attempt failed, will retry. Reason: Key not found : /torchelastic/p2p/run_imagenet/rdzv/active_version
INFO 2020-04-27 18:53:08,502 Attempting to join next rendezvous
INFO 2020-04-27 18:53:08,506 New rendezvous state created: {'status': 'joinable', 'version': '11', 'participants': []}
INFO 2020-04-27 18:53:08,541 Joined rendezvous version 11 as rank 0. Full state: {'status': 'joinable', 'version': '11', 'participants': [0]}
INFO 2020-04-27 18:53:08,541 Rank 0 is responsible for join last call.
INFO 2020-04-27 18:53:09,504 Attempting to join next rendezvous
INFO 2020-04-27 18:53:09,507 Observed existing rendezvous state: {'status': 'joinable', 'version': '11', 'participants': [0]}
INFO 2020-04-27 18:53:09,540 Joined rendezvous version 11 as rank 1. Full state: {'status': 'frozen', 'version': '11', 'participants': [0, 1], 'keep_alives': []}
INFO 2020-04-27 18:53:09,540 Waiting for remaining peers.
INFO 2020-04-27 18:53:09,541 Rank 0 finished join last call.
INFO 2020-04-27 18:53:09,541 Waiting for remaining peers.
INFO 2020-04-27 18:53:09,541 All peers arrived. Confirming membership.
INFO 2020-04-27 18:53:09,541 All peers arrived. Confirming membership.
INFO 2020-04-27 18:53:09,567 Waiting for confirmations from all peers.
INFO 2020-04-27 18:53:09,574 Waiting for confirmations from all peers.
INFO 2020-04-27 18:53:09,575 Rendezvous version 11 is complete. Final state: {'status': 'final', 'version': '11', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_imagenet/rdzv/v_11/rank_1', '/torchelastic/p2p/run_imagenet/rdzv/v_11/rank_0'], 'num_workers_waiting': 0}
INFO 2020-04-27 18:53:09,575 Rendezvous version 11 is complete. Final state: {'status': 'final', 'version': '11', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_imagenet/rdzv/v_11/rank_1', '/torchelastic/p2p/run_imagenet/rdzv/v_11/rank_0'], 'num_workers_waiting': 0}
INFO 2020-04-27 18:53:09,575 Creating EtcdStore as the c10d::Store implementation
INFO 2020-04-27 18:53:09,575 Creating EtcdStore as the c10d::Store implementation
Cuda is available: True
Cuda is available: True
=> set cuda device = 0
=> set cuda device = 0
=> creating model: resnet18
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
=> start_epoch: 0, best_acc1: 0
Instance: [i-015067026ed8f10a3] Epoch: [0][   0/1563]   Time  0.724 ( 0.724)    Data  0.077 ( 0.077)    Loss 7.0555e+00 (7.0555e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)
Instance: [i-06c6160d86fae08d9] Epoch: [0][   0/1563]   Time  0.730 ( 0.730)    Data  0.071 ( 0.071)    Loss 7.1953e+00 (7.1953e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)

training hang when remove/add instances

🐛 Bug

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Start etcd process
Start two instances of Classy Vision example with min=1 and max=10
Wait for training start
Kill one of the instance
rendezvous is updated and training stuck after executing this line

You can also try to add new instances to the cluster, training stuck at the same place.

Expected behavior

Adding or removing instances within [min, max] shouldn't hang

Environment

torchelastic version (e.g. 0.1.0rc1): 0.1.0rc2
OS (e.g., Linux):ubuntu18.04 and MacOS
How you installed torchelastic (conda, pip, source, docker):docker
Docker image and tag (if using docker):pytorch/pytorch:latest
Build command you used (if compiling from source):pip install
Git commit (if installed from source):f5a0968aa3842a1f29e1d36aee59cb2acff5cfaa
Python version:3.6
CUDA/cuDNN version:10.0
GPU models and configuration:V100
Execution environment (on-prem, aws, etc):on-prem
Any other relevant information:

Additional context

gdb output:

(gdb) bt
#0 0x00007f10cab4c98d in pthread_join (threadid=139707616458496, thread_return=0x0) at pthread_join.c:90
#1 0x00007f10bb08ab97 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f10bbb66754 in c10d::TCPStoreDaemon::~TCPStoreDaemon() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#3 0x00007f10bbb6684e in c10d::TCPStore::~TCPStore() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#4 0x00007f10bbb66979 in c10d::TCPStore::~TCPStore() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
...
(gdb) info threads
Id Target Id Frame
1 Thread 0x7f10caf7e700 (LWP 86) "python" 0x00007f10cab4c98d in pthread_join (threadid=139707616458496, thread_return=0x0) at pthread_join.c:90
2 Thread 0x7f106b92b700 (LWP 91) "python" 0x00007f10cab53a15 in futex_abstimed_wait_cancelable (private=0, abstime=0x7f106b929ec0, expected=0, futex_word=0x7f10640012a0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
3 Thread 0x7f106b129780 (LWP 92) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
...

Remove dependency to torch 1.5.0 nightly when 1.5.0 releases

🐛 Bug

Currently torchelastic-0.1.0rc2 (trunk) depends on torch nightly to use EtcdStore instead of the TCPStore. The following two things need to happen when torch 1.5.0 releases:

In requirements.txt change the dependency to torch>1.4.0 to torch>=1.5.0
Edit .circleci/config.yml#install_dep target to remove the manual install of torch nightly

Known workarounds

Manually install torch nightly (1.5.0.dev+). For more details see https://pytorch.org/

conda (cpu or gpu)

conda install pytorch torchvision -c pytorch

pip (cpu)

 pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

pip(gpu)

 pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

Then install torchelastic (where ~/elastic is the repo root)

cd ~/elastic
python setup.py install
-- or --
pip install -e ~/elastic

Component (check all that applies):

To Reproduce

N/A

Expected behavior

python setup.py test should pass all tests

Environment

torchelastic version (e.g. 0.1.0rc1): 0.1.0rc2
OS (e.g., Linux): Linux
How you installed torchelastic (conda, pip, source, docker): from source
Docker image and tag (if using docker):
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc):
Any other relevant information:

Additional context

See
#37
#34

Remove unnecessary port in Kubernetes operator examples

Description

In operator examples, we add container port there because underneath framework requires a port to create headless service. This is unnecessary for torch elastic job because it doesn't open exact same port for workers to communicate with each other.

elastic/kubernetes/config/samples/imagenet.yaml

Lines 23 to 25 in 114e81c

 ports: 

 - containerPort: 10291 

 name: elasticjob-port

I resolve the dependency issue here.
kubeflow/common@0b3a4c3

We can update dependency to latest to remove those blocks which won't confuse users anymore.

Motivation/Background

Simplify examples

Detailed Proposal

Return empty string here. elastic doesn't have to implement this interface.

elastic/kubernetes/controllers/elasticjob_controller.go

Lines 184 to 190 in 114e81c

 func (r *ElasticJobReconciler) GetDefaultContainerPortNumber() int32 { 

 return v1alpha1.DefaultPort 

 } 

 func (r *ElasticJobReconciler) GetDefaultContainerPortName() string { 

 return v1alpha1.DefaultContainerPortName 

 }

Imagenet example crashing on Kubernetes

🐛 Bug

The Imagenet example does not work on Kubernetes, crashes with error " ValueError: Start index 100000 should be less than dataset size 100000"

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Follow the steps as per the K8s docs to configure EKS/K8s cluster and get the imagenet example working on it
See the logs for the imagenet example, it crashes after sometime, see details for error below:

Error logs:

kubectl logs -f imagenet-worker-0  -n elastic-job

[INFO] 2020-04-07 20:09:52,429 main: epoch: 0, iteration: 1557, data_idx: 149248
[INFO] 2020-04-07 20:09:53,930 main: epoch: 0, iteration: 1558, data_idx: 149344
[INFO] 2020-04-07 20:09:55,541 main: epoch: 0, iteration: 1559, data_idx: 149440
[INFO] 2020-04-07 20:09:57,083 main: epoch: 0, iteration: 1560, data_idx: 149536
[INFO] 2020-04-07 20:09:58,689 main: epoch: 0, iteration: 1561, data_idx: 149632
[INFO] 2020-04-07 20:10:00,381 main: epoch: 0, iteration: 1562, data_idx: 149728
[INFO] 2020-04-07 20:10:01,914 main: epoch: 0, iteration: 1563, data_idx: 149824
[INFO] 2020-04-07 20:10:03,498 main: epoch: 0, iteration: 1564, data_idx: 149920
[INFO] 2020-04-07 20:10:04,813 main: epoch: 0, iteration: 1565, data_idx: 150016
[ERROR] 2020-04-07 20:10:04,864 coordinator_p2p: Rank: 1
Error: Start index 100000 should be less than dataset size 100000
ErrorType: <class 'ValueError'>
StackTrace: Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 36, in __next__
    return next(self._iter)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    index = self._next_index()  # may raise StopIteration
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 318, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 96, in train
    state, worker_stats = train_step(state)
  File "/workspace/examples/imagenet/main.py", line 285, in train_step
    input, target = next(state.data_iter)
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 40, in __next__
    self._iter = self._generator_fn(self._epoch)
  File "/workspace/examples/imagenet/main.py", line 177, in _data_iter_generator_fn
    start_index=self.data_start_index,
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/elastic_distributed_sampler.py", line 41, in __init__
    start_index, len(dataset)
ValueError: Start index 100000 should be less than dataset size 100000

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 36, in __next__
    return next(self._iter)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    index = self._next_index()  # may raise StopIteration
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 318, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/examples/imagenet/main.py", line 429, in <module>
    main()
  File "/workspace/examples/imagenet/main.py", line 410, in main
    args.input_path,
  File "/workspace/examples/imagenet/main.py", line 276, in single_trainer
    torchelastic.train(coordinator, train_step, state)
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 96, in train
    state, worker_stats = train_step(state)
  File "/workspace/examples/imagenet/main.py", line 285, in train_step
    input, target = next(state.data_iter)
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/cycling_iterator.py", line 40, in __next__
    self._iter = self._generator_fn(self._epoch)
  File "/workspace/examples/imagenet/main.py", line 177, in _data_iter_generator_fn
    start_index=self.data_start_index,
  File "/opt/conda/lib/python3.6/site-packages/torchelastic/utils/data/elastic_distributed_sampler.py", line 41, in __init__
    start_index, len(dataset)
ValueError: Start index 100000 should be less than dataset size 100000

Expected behavior

Training job should complete without any errors

User loss of work if the cluster change occurs in the middle of the epoch

Description

Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.

Implement or delete `Checkpoint.discard()`

🚀 Feature

Checkpoint API currently has a discard() method that is unimplemented and marked with a TODO. This should either be implemented or removed.

Motivation

Cleaner API

Pitch

N/A

Alternatives

N/A

Additional context

https://github.com/pytorch/elastic/blob/master/torchelastic/checkpoint/api.py#L203

Code structure for kubernetes operator codes

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

Hi, I am ready to submit a PR for kubernetes operator. Looking at current structure, we have kubernetes native examples under /kubernetes and we also have examples/ folder.

Should I put operator codes under /kubernetes or a separate /operator folder? Should I move kubernetes native examples under /examples?

Why petctl name for the AWS tool?

I'm just wondering, why petctl name for the AWS tool? Is it an abbreviation or maybe it has some other meaning?

Pytorch Lightning with TorchElastic - One worker doesn't start

🐛 Bug

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Somehow, one of the worker doesn't start. Any idea what is wrong ?

apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
  name: s-9670336-3-7-1-5-e
  namespace: elastic-job
spec:
  # Use "etcd-service:2379" if you already apply etcd.yaml
  rdzvEndpoint: etcd://10.100.165.225:2379/s-9670336-3-7-1-5-e
  minReplicas: 2
  maxReplicas: 2
  replicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: ExitCode
      template:
        apiVersion: v1
        kind: Pod
        spec:
          containers:
            - name: s-9670336-3-7-1-5-e
              image: pytorchlightning/pytorch_lightning:base-cuda-py3.7-torch1.5
              imagePullPolicy: Always
              command: 
              - bash 
              - -ce
              - | 
                git clone https://github.com/tchaton/pytorch-lightning.git /repo
                cd /repo
                git fetch --all
                git checkout 9670336
                pip install -e .
                pip install torchelastic
                python -m torchelastic.distributed.launch --nproc_per_node=1 --rdzv_id=s-9670336-3-7-1-5-e --rdzv_backend=etcd --rdzv_endpoint=10.100.165.225:2379 ./multi_node_tests/test_multi_nodes_gpu.py --num_nodes=2 --gpus=1 --accelerator=ddp --max_epochs 2
              resources:
                limits:
                  nvidia.com/gpu: 1

import sys
import os
ROOT = os.path.join(os.path.dirname(os.path.realpath(__file__)), "..")
sys.path.insert(0, ROOT)
DIR_PATH = os.path.dirname(os.path.realpath(__file__))
import torch
from argparse import ArgumentParser
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from pytorch_lightning import LightningModule
from torch.utils.data import Dataset
from tests.base.boring_model import *

def cli_main():
    pl.seed_everything(1234)

    # ------------
    # args
    # ------------
    parser = ArgumentParser()
    parser.add_argument('--batch_size', default=32, type=int)
    parser = pl.Trainer.add_argparse_args(parser)
    args = parser.parse_args()

    model = BoringModel()

    trainer = pl.Trainer.from_argparse_args(args)
    trainer.fit(model)

if __name__ == '__main__':
    cli_main()

Expected behavior

Environment

torchelastic version (e.g. 0.1.0rc1):
OS (e.g., Linux):
How you installed torchelastic (conda, pip, source, docker):
Docker image and tag (if using docker):
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc):
Any other relevant information:

Additional context

Missing link in PET docs

📚 Documentation

There's a missing link on the docs page on this page: https://pytorch.org/elastic/0.2.0.dev0/quickstart.html

Link

https://pytorch.org/elastic/0.2.0.dev0/quickstart.html

What does it currently say?

"Learn more about writing your distributed training script here <train_script.html>."

What should it say?

It should link correctly to train_script.html

Why?

The correct link does not show up on the docs page

Imagenet example not working on EKS

🐛 Bug

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Setup PET EKS cluster as per the steps described in https://pytorch.org/elastic/0.1.0rc2/kubernetes.html
Download imagenet dataset pod gives errors as below.
The EFS mountpoints for the 3 subnets in the EKS VPC look good (see below).
On verifying the VPC Subnet and Security group settings, there is no ingress config for NFS.

Errors:
Unable to mount volumes for pod "download-dataset-task_default(e4e4a7e5-67e8-11ea-b947-024ee852554a)": timeout expired waiting for volumes to attach or mount for pod "default"/"download-dataset-task". list of unmounted volumes=[persistent-storage]. list of unattached volumes=[persistent-storage default-token-txtpt]

aws efs describe-mount-targets --file-system-id fs-cb6d9fb3
{
"MountTargets": [
{
"OwnerId": "379740236983",
"MountTargetId": "fsmt-5ca5d225",
"FileSystemId": "fs-cb6d9fb3",
"SubnetId": "subnet-01e7d9b7049dac31b",
"LifeCycleState": "available",
"IpAddress": "192.168.181.220",
"NetworkInterfaceId": "eni-0ae209289f623ff87",
"AvailabilityZoneId": "use2-az1",
"AvailabilityZoneName": "us-east-2a"
},
{
"OwnerId": "379740236983",
"MountTargetId": "fsmt-5ea5d227",
"FileSystemId": "fs-cb6d9fb3",
"SubnetId": "subnet-00546edc603da36cd",
"LifeCycleState": "available",
"IpAddress": "192.168.61.106",
"NetworkInterfaceId": "eni-03fb716a21c92efc7",
"AvailabilityZoneId": "use2-az3",
"AvailabilityZoneName": "us-east-2c"
},
{
"OwnerId": "379740236983",
"MountTargetId": "fsmt-5fa5d226",
"FileSystemId": "fs-cb6d9fb3",
"SubnetId": "subnet-07b0ce007053789db",
"LifeCycleState": "available",
"IpAddress": "192.168.97.136",
"NetworkInterfaceId": "eni-06324709e191c50e5",
"AvailabilityZoneId": "use2-az2",
"AvailabilityZoneName": "us-east-2b"
}
]
}

Expected behavior

Sample should work with training run completed as per the instructions

Environment

EKS install using eksctl command with 2 GPU nodes.

Additional context

Out of the box Kubernetes documentation/support

Description

Would be best to provide kubernetes examples and docs as to make this not AWS specific.
K8s with auto-scaling could also potentially simplify any of the aws specific code and allow for greater options of pipelines in the future

Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0

🐛 Bug

I am getting a Pip error when trying to install TorchElastic 0.2.2 with the recently released PyTorch 1.8. Seems like a small patch update will be needed to fix the requirement issue.
https://github.com/pytorch/pytorch/releases/tag/v1.8.0

ModuleNotFoundError: No module named 'torch.distributed.elastic'

🐛 Bug

Component (check all that applies):

To Reproduce

Steps to reproduce the behavior:

git clone https://github.com/pytorch/elastic
cd elastic; docker build -t elastic:1.5-py3 -f Dockerfile .
docker run -it --runtime=nvidia -v $(pwd):/workspace elastic:1.5-py3
python -m torchelastic.distributed.launch
-->
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/torchelastic/torchelastic/distributed/launch.py", line 227, in
import torch.distributed.elastic.rendezvous.registry as rdzv_registry
ModuleNotFoundError: No module named 'torch.distributed.elastic'

Expected behavior

No error

Environment

torchelastic version (e.g. 0.1.0rc1): 0.2.2
OS (e.g., Linux): Ubuntu 18.04.5 LTS
How you installed torchelastic (conda, pip, source, docker): pip install torchelastic
Docker image and tag (if using docker): https://github.com/pytorch/elastic/blob/master/Dockerfile
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc):
Any other relevant information:

Additional context

Support for NCCL backend

❓ Questions and Help

Question

ImageNet example provided two options for distributed backed- gloo and nccl. I see the following error when I specify nccl as the backend

INFO 2020-03-18 18:12:24,436 All peers arrived. Confirming membership.
INFO 2020-03-18 18:12:24,504 Waiting for confirmations from all peers.
INFO 2020-03-18 18:12:24,506 Rendezvous version 4 is complete. Final state: {'status': 'final', 'version': '4', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_bc8168d4694311eaa33f000d3a77161e/rdzv/v_4/rank_0'], 'num_workers_waiting': 0}
INFO 2020-03-18 18:12:24,506 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-03-18 18:12:24,509 coordinator_p2p: Got next rendezvous: rank 0, world size 1
[INFO] 2020-03-18 18:12:24,516 coordinator_p2p: Initialized process group rank 0, world size 1
[ERROR] 2020-03-18 18:12:24,517 coordinator_p2p: Rank: 0
Error: Tensors must be CUDA and dense
ErrorType: <class 'RuntimeError'>
StackTrace: Traceback (most recent call last):
File "/opt/miniconda/lib/python3.6/site-packages/torchelastic/train_loop.py", line 94, in run_train
state.sync(world_size, rank)
File "main.py", line 96, in sync
self._sync_state(rank)
File "main.py", line 130, in _sync_state
dist.broadcast(state_size, src=max_rank)
File "/opt/miniconda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 804, in broadcast
work = _default_pg.broadcast([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

Does elastic training support NCCL backend?
Are there plans to support NCCL in future, if it isn't supported currently

[rdzv] Implement `etcd_rendezvous#load_extra_data()` timeout.

🚀 Feature

In EtcdRendezvous' the method load_extra_data(self, rdzv_version, key, timeout=None) has as timeout parameter which never gets used. Implement the timeout enforcement by using this parameter.

Motivation

Unimplemented method feature.

Pitch

N/A

Alternatives

N/A

Additional context

Elastic agent doesn't detect worker failures in NCCL

Context

I have been using torchelastic for a while to launch fault-tolerant jobs on CPUs using the gloo backend. I was switching to GPUs so that I can use broadcast and reduce. I firstly made the necessary modifications to move everything onto GPUs. Then, I changed the backend for group initialization from gloo to nccl hoping things will work as before. However, for nccl, when some workers gets killed, the remaining workers stay in the previous rendezvous and hang, whereas the elastic agent should be able to detect a worker failure and halts all workers.

Current Behavior

When using the nccl backend, when a worker is killed, the remaining workers hang instead of throwing a RuntimeError during all_reduce() like when using the gloo backend.

The workers that are killed outputs this (which is expected):

...
[INFO] 2020-11-16 05:37:30,257 api: [default] All workers successfully finished.
>

However, for the remaining workers, the elastic agent doesn't declare the process group as failed. Here is the log obtained by using export NCCL_DEBUG=INFO:

multigpu:141:158 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
multigpu:141:158 [0] NCCL INFO transport/net_socket.cc:405 -> 2
multigpu:141:158 [0] NCCL INFO include/net.h:28 -> 2
multigpu:141:158 [0] NCCL INFO transport/net.cc:357 -> 2
multigpu:141:158 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]

Expected Behavior

Just like gloo, after some workers are killed, the remaining workers/gloo should be able to detect a missing member during all_reduce(), and throw a RuntimeError so that the local_elastic_agent can mark the worker group as failed, halt the training, and wait for a new worker to join the next rendezvous.

The workers that are killed should output this:

...
[INFO] 2020-11-16 05:13:25,931 api: [default] All workers successfully finished.
>

The surviving workers should output this:

...
Traceback (most recent call last):
  File "worker.py", line 250, in <module>
    parse_args()
  File "worker.py", line 246, in parse_args
    init_processes(0, args)
  File "worker.py", line 219, in init_processes
    train(args)
  File "worker.py", line 130, in train
    update_gradients(model)
  File "worker.py", line 55, in update_gradients
    dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 948, in all_reduce
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [10.0.1.26]:5511
[ERROR] 2020-11-16 05:23:48,975 local_elastic_agent: [default] Worker group failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torchelastic-0.2.0-py3.8.egg/torchelastic/agent/server/local_elastic_agent.py", line 190, in _monitor_workers
    if self._process_context.join(timeout=-1):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/torchelastic-0.2.0-py3.8.egg/torchelastic/agent/server/local_elastic_agent.py", line 79, in _wrap
    ret = fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/torchelastic-0.2.0-py3.8.egg/torchelastic/distributed/launch.py", line 392, in wrapper_fn
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3.8', '-u', 'worker.py', '-pindex', '1', '-jobid', '53706', '-num_iters', '938']' returned non-zero exit status 1.

[INFO] 2020-11-16 05:23:48,975 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2020-11-16 05:23:48,975 api: [default] Stopping worker group
[INFO] 2020-11-16 05:23:48,976 api: [default] Rendezvous'ing worker group
INFO 2020-11-16 05:23:48,976 Attempting to join next rendezvous
INFO 2020-11-16 05:23:48,980 Observed existing rendezvous state: {'status': 'final', 'version': '41', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_53706/rdzv/v_41/rank_1', '/torchelastic/p2p/run_53706/rdzv/v_41/rank_0'], 'num_workers_waiting': 0}
INFO 2020-11-16 05:23:49,059 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "41", "participants": [0, 1], "keep_alives": ["/torchelastic/p2p/run_53706/rdzv/v_41/rank_1", "/torchelastic/p2p/run_53706/rdzv/v_41/rank_0"], "num_workers_waiting": 1}
INFO 2020-11-16 05:23:49,065 Keep-alive key /torchelastic/p2p/run_53706/rdzv/v_41/rank_0 is not renewed.
INFO 2020-11-16 05:23:49,066 Rendevous version 41 is incomplete.
INFO 2020-11-16 05:23:49,066 Attempting to destroy it.
INFO 2020-11-16 05:23:49,072 Destroyed rendezvous version 41 successfully.
INFO 2020-11-16 05:23:49,073 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-11-16 05:23:49,073 Attempting to join next rendezvous
INFO 2020-11-16 05:23:49,089 New rendezvous state created: {'status': 'joinable', 'version': '42', 'participants': []}
INFO 2020-11-16 05:23:49,163 Joined rendezvous version 42 as rank 0. Full state: {'status': 'joinable', 'version': '42', 'participants': [0]}
INFO 2020-11-16 05:23:49,163 Waiting for remaining peers.

More details

I'm using dist.init_process_group(backend='gloo', init_method='env://') to initialize the process group.
I'm using the torchelastic launcher to launch the workers:

python3.8 -m torchelastic.distributed.launch --nnodes=2 --nproc_per_node=1 --rdzv_id=53706 --rdzv_backend=etcd --rdzv_endpoint=10.0.1.26:2379 worker.py

OS: Linux 5.3.0-1032-azure x86_64; Ubuntu 18.04.4
CUDA and NCCL version: CUDA11.0 (11.0-devel-ubuntu18.04), NCCL2.7.8-1
Framework (TF, PyTorch, MXNet): PyTorch3.8 (1.7.0+cu110)
torchelastic release: 0.2.0 (45dc33f)
Please let me know if I need to provide more information!

ClassyVision distributed training hang after scaledown training Nodes

🐛 Bug

Component (check all that applies):

Background

We are also trying to support PET on K8s, and the ImageNet Example is already supported without any issue, see: https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic.

However, there are some issues to support distributed ClassyVision, the main issue is this hang issue.
And this issue MAY be already reported by #25, but not active anymore.

To Reproduce

Reproduce Key Points

Current ClassyVision K8S Example, seems is not "distributed training", as --distributed_backend is not specified to be ddp.
So I used the ddp and found the issue.
Current ClassyVision Default Example only has 2 num_epochs, so it runs very short (<1min), and you can hardly test the scaledown during the short period running.
So I increased the num_epochs to 100.
Current ClassyVision K8S Example, cannot specify --num_workers=0, as ClassyVision will crash due to error: classy_train.py: error: unrecognized arguments: --num_workers=0, so I tried to add the num_workers: 0 into ClassyVision config file, like this UT. However, it still crash due to error: ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0.
So, I do not specified num_workers in any place, and use this k8s apporach to expose shared memory to containers.

Steps to reproduce the issue on K8s:
(You can do similar thing without K8s, using K8s here for simple.)

Assume etcd is already setup and its address is pet-etcd:2379
Create P2P discovery Service on K8s:

apiVersion: v1
kind: Service
metadata:
  name: cv-test
spec:
  clusterIP: None
  publishNotReadyAddresses: true
  selector:
    app: cv-test

Create 3 below Pods on K8s, each with below {{INDEX}} placehoder instantiated to be 0, 1, 2:

apiVersion: v1
kind: Pod
metadata:
  name: cv-test-{{INDEX}}
  labels:
    app: cv-test
spec:
  hostname: cv-test-{{INDEX}}
  subdomain: cv-test
  containers:
  - name: elasticjob-worker
    image: torchelastic/examples:0.2.0
    imagePullPolicy: Always
    command: [
      "bash", "-c",
      "sed -i -e 's/\"num_epochs\": 2/\"num_epochs\": 100/g'
      /workspace/classy_vision/configs/template_config.json &&
      python -m torchelastic.distributed.launch
      --rdzv_backend=etcd
      --rdzv_endpoint=pet-etcd:2379
      --rdzv_id=cv-test
      --nnodes=1:4
      --nproc_per_node=1
      /workspace/classy_vision/classy_train.py
      --config_file /workspace/classy_vision/configs/template_config.json
      --distributed_backend ddp"]
    volumeMounts:
    - name: shm-volume
      mountPath: /dev/shm
  volumes:
  - name: shm-volume
    emptyDir:
      medium: Memory

All Pods will train with log like below:

[INFO] 2020-08-03 07:53:06,129 launch: Running torchelastic.distributed.launch with args: ['/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py', '--rdzv_backend=etcd', '--rdzv_endpoint=pet-etcd:2379', '--rdzv_id=cv-test', '--nnodes=1:4', '--nproc_per_node=1', '/workspace/classy_vision/classy_train.py', '--config_file', '/workspace/classy_vision/configs/template_config.json', '--distributed_backend', 'ddp']
INFO 2020-08-03 07:53:06,136 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-08-03 07:53:06,144 launch: Using nproc_per_node=1.
[INFO] 2020-08-03 07:53:06,890 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-08-03 07:53:06,890 api: [default] Rendezvous'ing worker group
INFO 2020-08-03 07:53:06,890 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,894 Observed existing rendezvous state: {'status': 'final', 'version': '15', 'participants': [0], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:06,988 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "15", "participants": [0], "keep_alives": ["/torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0"], "num_workers_waiting": 1}
INFO 2020-08-03 07:53:06,990 Keep-alive key /torchelastic/p2p/run_cv-test/rdzv/v_15/rank_0 is not renewed.
INFO 2020-08-03 07:53:06,990 Rendevous version 15 is incomplete. 
INFO 2020-08-03 07:53:06,990 Attempting to destroy it.
INFO 2020-08-03 07:53:06,991 Destroyed rendezvous version 15 successfully.
INFO 2020-08-03 07:53:06,992 Previously existing rendezvous state changed. Will re-try joining.
INFO 2020-08-03 07:53:06,992 Attempting to join next rendezvous
INFO 2020-08-03 07:53:06,999 New rendezvous state created: {'status': 'joinable', 'version': '16', 'participants': []}
INFO 2020-08-03 07:53:07,012 Joined rendezvous version 16 as rank 0. Full state: {'status': 'joinable', 'version': '16', 'participants': [0]}
INFO 2020-08-03 07:53:07,013 Rank 0 is responsible for join last call.
INFO 2020-08-03 07:53:38,023 Rank 0 finished join last call.
INFO 2020-08-03 07:53:38,024 Waiting for remaining peers.
INFO 2020-08-03 07:53:38,025 All peers arrived. Confirming membership.
INFO 2020-08-03 07:53:38,120 Waiting for confirmations from all peers.
INFO 2020-08-03 07:53:38,122 Rendezvous version 16 is complete. Final state: {'status': 'final', 'version': '16', 'participants': [0, 1, 2], 'keep_alives': ['/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_1', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_2', '/torchelastic/p2p/run_cv-test/rdzv/v_16/rank_0'], 'num_workers_waiting': 0}
INFO 2020-08-03 07:53:38,122 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-08-03 07:53:38,128 api: [default] Rendezvous complete for workers.
Result:
	restart_count=0
	group_rank=0
	group_world_size=3
	rank stride=1
	assigned global_ranks=[0]
	master_addr=cv-test-0.cv-test.default.svc.cluster.local
	master_port=37129

[INFO] 2020-08-03 07:53:38,128 api: [default] Starting worker group
INFO:root:Classy Vision's default training script.
INFO:root:AMP disabled
INFO:root:mixup disabled
INFO:root:Synchronized Batch Normalization is disabled
INFO:root:Logging outputs to /workspace/classy_vision/output_2020-08-03T07:53:39.917536
INFO:root:Logging checkpoints to /workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints
WARNING:root:tensorboardX not installed, skipping tensorboard hooks
INFO:root:Starting training on rank 0 worker. World size is 1
INFO:root:Done setting up distributed process_group with rank 0, world_size 3
INFO:root:Using GPU, CUDA device index: 0
INFO:root:Starting training. Task: <classy_vision.tasks.classification_task.ClassificationTask object at 0x7f4a4992b2d0> initialized with config:
{
    "name": "classification_task",
    "num_epochs": 100,
    "loss": {
        "name": "my_loss"
    },
    "dataset": {
        "train": {
            "name": "my_dataset",
            "crop_size": 224,
            "class_ratio": 0.5,
            "num_samples": 320,
            "seed": 0,
            "batchsize_per_replica": 32,
            "use_shuffle": true,
            "transforms": [
                {
                    "name": "generic_image_transform",
                    "transforms": [
                        {
                            "name": "RandomResizedCrop",
                            "size": 224
                        },
                        {
                            "name": "RandomHorizontalFlip"
                        },
                        {
                            "name": "ToTensor"
                        },
                        {
                            "name": "Normalize",
                            "mean": [
                                0.485,
                                0.456,
                                0.406
                            ],
                            "std": [
                                0.229,
                                0.224,
                                0.225
                            ]
                        }
                    ]
                }
            ]
        },
        "test": {
            "name": "my_dataset",
            "crop_size": 224,
            "class_ratio": 0.5,
            "num_samples": 100,
            "seed": 1,
            "batchsize_per_replica": 32,
            "use_shuffle": false,
            "transforms": [
                {
                    "name": "generic_image_transform",
                    "transforms": [
                        {
                            "name": "Resize",
                            "size": 256
                        },
                        {
                            "name": "CenterCrop",
                            "size": 224
                        },
                        {
                            "name": "ToTensor"
                        },
                        {
                            "name": "Normalize",
                            "mean": [
                                0.485,
                                0.456,
                                0.406
                            ],
                            "std": [
                                0.229,
                                0.224,
                                0.225
                            ]
                        }
                    ]
                }
            ]
        }
    },
    "meters": {
        "accuracy": {
            "topk": [
                1
            ]
        }
    },
    "model": {
        "name": "my_model"
    },
    "optimizer": {
        "name": "sgd",
        "param_schedulers": {
            "lr": {
                "name": "step",
                "values": [
                    0.1,
                    0.01
                ]
            }
        },
        "weight_decay": 0.0001,
        "momentum": 0.9,
        "num_epochs": 100,
        "lr": 0.1,
        "nesterov": false,
        "use_larc": false,
        "larc_config": {
            "clip": true,
            "eps": 1e-08,
            "trust_coefficient": 0.02
        }
    }
}
INFO:root:Number of parameters in model: 2402
WARNING:root:Model contains unsupported modules, could not compute FLOPs for model forward pass.
INFO:root:Model does not implement input_shape. Skipping activation calculation.
INFO:root:Synced meters: [0] train phase 0 (100.00% done), loss: 0.1719, meters: [accuracy_meter(top_1=0.850467)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...
INFO:root:Synced meters: [0] test phase 0 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 1 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...

Then delete Pod whose group_rank is 0 (in this example, it is cv-test-0), and the training will hang forever, at log line like below:
(no more log will produce anymore, but the Pod is still forever running)

INFO:root:Synced meters: [0] test phase 40 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Synced meters: [0] train phase 41 (100.00% done), loss: 0.0000, meters: [accuracy_meter(top_1=1.000000)]
INFO:root:Saving checkpoint to '/workspace/classy_vision/output_2020-08-03T07:53:39.917536/checkpoints'...

Expected behavior

After scale down, remaining workers should re-rendezvous and recover from last epoch checkpoint, with log like:
The step 6 in https://github.com/microsoft/frameworkcontroller/tree/master/example/framework/scenario/pytorch/elastic#imagenet-example

Environment

torchelastic version (e.g. 0.1.0rc1): torchelastic/examples:0.2.0
OS (e.g., Linux): torchelastic/examples:0.2.0
How you installed torchelastic (conda, pip, source, docker): torchelastic/examples:0.2.0
Docker image and tag (if using docker): torchelastic/examples:0.2.0
Build command you used (if compiling from source):
Git commit (if installed from source):
Python version: torchelastic/examples:0.2.0
CUDA/cuDNN version:
GPU models and configuration:
Execution environment (on-prem, aws, etc): K8s
Any other relevant information:

Additional context

Better to also fix the issue in ClassyVision repo:
https://github.com/facebookresearch/ClassyVision/blob/master/examples/elastic/docker-compose.yaml

For Kubernetes provide sample using Nvidia GPU operator

Description

Add a sample that uses the Nvidia GPU operator for Kubernetes instead on the older daemonset based approach

Motivation/Background

The Nvidia GPU operator simplifies the deployment and management of gpu containers on Kubernetes. Using this sol will make it easier to manage and monitor the GPU resources. More details here.

Detailed Proposal

Use the Nvidia GPU operator as part of the base install steps for GPU training as described here

Redefine `should_save_checkpoint` in state

🚀 Feature

Currently state API has a should_save_checkpoint which has a couple of issues:

CheckpointUtil assumes that all workers will return the same value from should_save_checkpoint
CheckpointUtil chooses worker with rank == 0 to be the "representative" to load the checkpoint, then leans on sync() to broadcast the state to other workers.
#2 may not be the correct choice that generalizes to different use-cases. The "correct" logic should be to chose the worker with the "most-tenured" state (e.g. the most up to date state) to broadcast the state.

Motivation

The checkpoint feature in torchelastic has many caveats (see above). Cleaning this logic up would make it clear for users on how to implement their state objects and also make it easier for users to reason about loading and saving of checkpoints and how that interacts with how they should be implementing sync() and load and save methods in the state class.

Pitch

Here's one way we could achieve this:

Define a get_most_tenured API that the user has to implement to return the rank of the worker with the most "up to date" state that should be shared with other workers on a rendezvous event.
Add helpers to broadcast state objects to the workers, this helper can be called in the sync() method. For instance:

def get_most_tenured_rank():
     # get the rank that has the most up to date state
     # or just return a consistent rank
     pass

def sync():
     most_tenured_rank = get_most_tenured_rank()
     dist_util.broadcast_state(state, most_tenured_rank)

Alternatives

Can bake most_tenured_rank concept into the checkpoint util

Additional context

[train_loop] Remove usage of `state` returned by `train_step`

Description

Remove assignment and usage (if any) of state as the return value from train_step in the train_loop.

https://github.com/pytorch/elastic/blob/master/torchelastic/train_loop.py#L29

Should change to

# <omitted...>
worker_stats = train_step(state)
# <omitted...>

Motivation/Background

Currently train expects train_step to return state and worker_stats. Returning state from train_step is unnatural since state is expected to be mutable and is the single parameter argument to train_step(state). In most cases users will do something like:

def train_step(state):
   # do train (mutates state)
   return state, worker_stats

It is both simple and natural for train_step to not return state.

Detailed Proposal

See code snippet in description section

Alternatives

N/A

Additional context/links

See link in description section

Kubernetes Konfusion

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

I'm a bit confused as the to the proper method to run Torch Elastic on Kubernetes:

There are instructions at this location https://github.com/pytorch/elastic/tree/master/examples/kubernetes/native which use EFS for checkpointing, load training code from s3, and it looks like it uses the image torchelastic/examples:0.1.0rc1
Then there are also separate instructions at this location https://github.com/pytorch/elastic/tree/master/kubernetes which seem to store checkpoints in the container's FS, seems to be getting the training code from the image it's running, and uses the image 0.2.0rc1

I'm a bit confused as to which set of instructions I should use, and in addition for the first one can't seem to find the place in code where the training code is pulled from s3. Could someone shed some light?

Switch to EtcdStore when torch 1.4 is released

🚀 Feature

Currently EtcdRendezvous returns a TCPStore. We have already implemented an EtcdStore which should be used to keep things consistent. At the moment it is not possible to use torch distributed process groups with EtcdStore since EtcdStore is a pure python implementation and torch distributed expects there to be a pybind. A trampoline was added in torch to make this possible and should be released with torch 1.4.

Motivation

See https://github.com/pytorch/elastic/blob/master/torchelastic/rendezvous/etcd_rendezvous.py#L105

Pitch

N/A

Alternatives

N/A

Additional context

Request for Feedback: PyTorch Elastic Trainer v0.2

Introduction

PyTorch Elastic Trainer (PET) provides a framework for conveniently training models across a compute cluster in a fault tolerant and elastic manner. PET provides these features in two ways:

When a PyTorch worker process throws a certain class of retriable errors, it is caught by PET and the training process is retried.
A new worker can leave or join the process pool for an existing training job at any point as long as the number of workers stays within the bounds specified when starting the job. When a membership change happens, all the workers re-rendezvous to establish a new process group and training resumes from the previous well-known good state.

In order to integrate with PET, a PyTorch user needs to make the following changes to their training logic:

They need to enable PET to control their training loop. Essentially, they provide an "inner training" loop that is wrapped in a retryable loop by PET. All aspects of establishing or re-establishing the process group as well as restoring the user's trainer to a known good state is handled by the retryable PET loop. See this for how the PET loop is implemented.
They need to specify what the state is that needs to be restored in case a new worker joins the pool and how the state is applied to a new worker. The API for specifying these is described by the State object here.

PET v.0.1 was released on GitHub, PyPI and Docker Hub in November 2019 and since then the community has contributed integrations with Amazon Web Services (via Elastic Kubernetes Service) and Microsoft Azure (via Azure Kubernetes Service).

Lessons learned from PET v0.1

In porting existing PyTorch-based projects such as ClassyVision and PyText to use PET, we encountered a few areas for refinement in the v0.1 design.

First, adapting a mature training library such as ClassyVision to use the elastic training APIs often requires a significant amount of restructuring often causing bifurcation of code paths between the elastic and non-elastic implementations.

Second, it is non-trivial to correctly implement the state restore logic for each application during in-process recovery. While explicit state such as weight tensors are easy to save and restore, there is often "hidden" or implicit state in the application that is hard for the developer to reason about. For example, after a rendezvous round, a worker process might be expected to restore the state of C++ objects either in CPU or GPU memory which are extremely error-prone, especially after failures or exceptions. To compound this issue, several applications such as PyText already implement some form of checkpoint/restart and this logic often needs to be taken into account when implementing the elastic state.

Finally, one of the goals of PET v0.1 was to detect and restart straggler workers. This was not possible when running the training loop in process and necessitated writing an additional watchdog process to monitor the main training process.

For the next iteration of PET, we would like to propose a design that makes it significantly simpler to port existing training workflows to an elastic infrastructure and results in applications that can recover more reliably from workflow failures.

Overview of the new design

In PET v.0.2, we no longer attempt to recover errors in the training function. Instead, PET attempts to maintain the number of worker processes such that they stay within the [min, max] bounds required for the job. The application writer is responsible for loading and restarting from an existing checkpoint file is available. Unlike v0.1, PET v0.2 does not mandate how checkpoints are managed. An application writer is free to use just torch.save and torch.load from PyTorch or a higher-level framework such as PyTorch Lightening.

PET v0.2 is implemented using a new process named elastic-agent. There is a single elastic-agent per job, per node. Each agent process is only responsible for managing a set of worker process local to that node and coordinating process group membership changes with elastic agents on other nodes allocated to that job. This is illustrated in the diagram below:

Membership changes are handled as followed: When a worker process fails, the corresponding elastic agent managing it kills all the workers on that node, establishes rendezvous with the other agents and restarts workers with the new rendezvous information. However, when an agent exits with a non-zero error code, it is up to a higher-level orchestrator such as Kubernetes to restart the agent (which in turn will restart all the workers it is responsible for). The same recovery mechanism holds for node-level failures. An orchestrator such as Kubernetes will schedule a job such that a minimum replicas of the elastic agent are running and each agent will in turn orchestrate the user's training script.

To adopt PET v0.2, an application simply needs its entry-point or main function to be compatible with the PyTorch distributed launcher. We expect distributed training jobs that are started via the distributed launcher to be seamlessly started via the elastic agent with none to minimal code changes. The only difference is that in the latter case, the application will be able to make progress in the presence of certain failures.

Overview of the API

As mentioned above, with PET v0.2, there is no separate library for a training application to integrate with. Instead, the user simply launches a training job via the elastic agent monitor process. For example, if a user starts their job using PyTorch distributed launcher using:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_ON_NODE
               TRAINING_SCRIPT.py (... train script args ...)

they would instead use:

python -m torchelastic.distributed.launch --nproc_per_node=NUM_GPUS_ON_NODE
                --nnodes=1:4
                --rdzv_id=JOB_ID
                --rdzv_backend=etcd
                --rdzv_endpoint=ETCD_HOST:ETCD_PORT
                TRAINING_SCRIPT.py (... train script args ...)

Notice that it adds a few additional parameters:

The min and max number of nodes. During a rendezvous, if the number of nodes drops below the specified threshold, the job is aborted.
A rendezvous type and its configuration.

In side the training script, the only potential change the user needs to do is to make sure that they use environment variables to initialize the process group, i.e., create the process group as follows:

import torch.distributed as dist

dist.init_process_group(init_method="env://", backend="gloo")
# or
dist.init_process_group(init_method="env://", backend="nccl")

All the parameters for initializing the group (the world size, the numerical rank, the master address and port) are passed in as environment variables by the parent elastic agent.

The new PET design is intentionally "bare-bones": it trade-offs the granularity with which an application can recover for simplicity and robustness. In the future, we hope to provide more APIs for convenient checkpointing that a developer can optionally use for more efficient restart semantics.

Implementation details and next steps

An implementation of the above ideas is available in PR #65. We encourage the community to give evaluate the new functionality and give us feedback on the trade-offs we have made in the design either in the PR or in this issue. We look forward to hearing from you!

Provide cluster delete steps

📚 Documentation

Link

https://pytorch.org/elastic/0.1.0rc2/kubernetes.html#clean-up-job

What does it currently say?

Steps for deleting the cluster are missing. Simply issuing eksctl delete cluster does not work due to EFS mount points added as part of the config

What should it say?

Provide clear steps for deleting the full cluster, including directions for unmounting EFS mount points

Why?

Missing from current documentation

Wrapping elastic-job kubernetes in initcontainer

After successfully running a ElasticJob on kubernetes, I would like to move the contents of an output folder to another location. For instance, my kubernetes cluster is actually EKS and I want to put output to S3.
I would usually wrap my Job/Deployment/Pod/.. such that the container moves the value of an initcontainer which actually performs the training, i.e.

`apiVersion: v1

kind: Pod
spec:
nodeSelector:
eks.amazonaws.com/nodegroup : worker-group
containers:
- name: output-data
image: XXX.ecr.us-XXX-X.amazonaws.com/side_car
command: ["aws", "s3" ,"cp", "/output" ,"s3://OUTPUT", "--recursive"]
volumeMounts:
- name: output
mountPath: /output
initContainers:
- name: worker
image: XXX-XXX-1.amazonaws.com/eks/pytorch:detectron2
imagePullPolicy: Always
args:
- "--nproc_per_node=4"
- "/code/tools/train_net_elastic.py"
resources:
limits:
nvidia.com/gpu: 4
...
...
`
Here my container just utilizes a side-car that does aws s3 copying and the initcontaiiner would be the trainer populating the /output directory.

However since the apiVersion: elastic.pytorch.org/v1alpha1 actually gives the
` Args:

  --rdzv_backend=etcd
  --rdzv_endpoint=etcd-service:2379
  --rdzv_id=em-tech-maskrcnn
  --nnodes=1:2`

to the container and not an initcontainer, I am seeking guidance on how to work around this obstacle.

Using IP address instead of name for TCPStore creation

In my setup, I get "host not found" errors for TCPStore :

INFO 2020-01-17 00:55:01,694 Using TCPStore for c10d::Store implementation
INFO 2020-01-17 00:55:01,702 Rank 1 will conenct to TCPStore server at pytorch-elastic-test-z2b7s:47279
[ERROR] 2020-01-17 00:55:01,724 coordinator_p2p: Rank: -1
Error: Rank -1 received an Exception. Detailed message: host not found: Name or service not known

Changing to use IP address instead of name for TCPStore creation fixes the issue for me.

$ git diff
diff --git a/torchelastic/rendezvous/etcd_rendezvous.py b/torchelastic/rendezvous/etcd_rendezvous.py
index 01215b6..219bff3 100644
--- a/torchelastic/rendezvous/etcd_rendezvous.py
+++ b/torchelastic/rendezvous/etcd_rendezvous.py
@@ -1074,7 +1074,7 @@ def setup_tcpstore(rank, world_size, rdzv_version, rdzv_impl):
         # FIXME: ideally, TCPStore should have an API that
         # accepts a pre-constructed socket.
         with closing(_get_socket_with_port()) as sock:
-            host = socket.gethostname()
+            host = socket.gethostbyname(socket.gethostname())
             port = sock.getsockname()[1]

Is there a reason why we may want to use name? Or using IP address always should be OK?

Or maybe because PyTorch 1.4 has been released we should just switch to EtcdStore?

Print most recent exception before failing on max_failures in train_loop

🚀 Feature

Currently train_loop.train will throw a RuntimeError when max_failures has been exhausted. Carry the information about the most recent exception when doing so.

Motivation

Helps with debugging

Pitch

N/A

Alternatives

N/A

Additional context

Link to code: https://github.com/pytorch/elastic/blob/master/torchelastic/train_loop.py#L43

pip install torchelastic fails looking for torch 1.5

❓ Questions and Help

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

I might be missing something, however when I follow the install instructions on the README of the repo it indicates to install with pip install torchelastic however when I do so I get the following error:

I even tried following the steps found in dockerhub for how you install torch on the 0.2.0rc1 image
/bin/sh -c pip uninstall -y -qqq torch
/bin/sh -c pip install --progress-bar off --pre torch -f https://download.pytorch.org/whl/test/cu101/torch_test.html

and still was met with the same error.

I'm trying to use this to test some stuff locally, just wondering what I could do to remedy this.

Out of Data documentation

📚 Documentation

In the documentation, there is a submodule multiprocessing. But I can't find it.

Link

https://pytorch.org/elastic/0.2.2/multiprocessing.html#torchelastic.multiprocessing.start_processes

What does it currently say?

from torchelastic.multiprocessing import Redirect, start_processes

What should it say?

I don't know. I can't find the multiprocessing module.

Others

1.pip can't install torchelastic==0.2.2
I used pip to install torchelastic. But I can only install the v0.2.0 version.
2. Documentation version switching
The documents can not be switched between difference versions.

my ENV

generated with https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py

Collecting environment information...
PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti

Nvidia driver version: 440.64
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] torch==1.8.1
[pip3] torchelastic==0.2.0
[pip3] torchvision==0.8.1
[conda] Could not collect

Add env support for the training script argument

Description

A custom use for this elastic tools is as below:

python -m torchelastic.distributed.launch
--nnodes=$NUM_NODES
--nproc_per_node=$WORKERS_PER_NODE
--rdzv_id=$JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=$ETCD_HOST:$ETCD_PORT
main.py
--arch resnet18
--epochs 20
--batch-size 32
<DATA_DIR>

Is that possible support args: nnodes, rdzv_id, rdzv_backend, rdzv_endpoint ... in env like $NUM_NODES, $JOB_ID,
$RDZV_BACKEND, $RDZV_ENDPOINT, and no need to present in the args.

Motivation/Background

It can make this elastic tools more smoothly in k8s and no need to take care of the args in the controller reconcile logic.

Detailed Proposal

One possible proposal is that support env in torchelastic/distributed/launch.py. if args not present, it will look for the env.

Alternatives

Additional context/links

[request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator?

cc @Jeffwan

I think we share the similar scope between pytorch-operator and elasticjob-operator. I am wondering if we can collaborate to support PyTorch and PyTorch elastic well.

Trouble connecting PersistentVolume/Claim to ElasticJob

Hey guys!

Use case here is to create a disk, populate it with data, and then in ReadWriteOnce mode connect it to the ElasticJob using k8 configs. To extend this, I'd like to move from ReadWriteOnce to ReadOnlyMany to act as a datastore that all nodes can read the dataset from!

Unfortunately it seems after populating the disk, the ElasticJob comes back with a persistentvolumeclaim "audio-data-claim" not found error message. With an non-elastic job I can see the disk has been populated and mounted.

I'm using GKE for this!

Configs

storage.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: pd.csi.storage.gke.io
metadata:
  name: storage
parameters:
  type: pd-ssd
  fstype: ext4
  replication-type: none

persistent_volume.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: audio-data
spec:
  storageClassName: "storage"
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: audio-data
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: audio-data-claim
spec:
  storageClassName: "storage"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

transfer_data.yaml

apiVersion: v1
kind: Pod
metadata:
  name: transfer-data
spec:
  containers:
    - image: seannaren/deepspeech.pytorch
      imagePullPolicy: Always
      name: deepspeech
      command: ["python"]
      args:
        - "/workspace/deepspeech.pytorch/data/an4.py"
        - "--target-dir=/audio-data/an4_dataset/"
        - "--manifest-dir=/audio-data/an4_manifests/"
      volumeMounts:
        - mountPath: /audio-data/
          name: audio-data
  restartPolicy: Never
  volumes:
    - name: audio-data
      persistentVolumeClaim:
        claimName: audio-data-claim

train.yaml

apiVersion: elastic.pytorch.org/v1alpha1
kind: ElasticJob
metadata:
  name: deepspeech
  namespace: elastic-job
spec:
  # Use "etcd-service:2379" if you already apply etcd.yaml
  rdzvEndpoint: "etcd-service:2379"
  minReplicas: 1
  maxReplicas: 1
  replicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: ExitCode
      template:
        apiVersion: v1
        kind: Pod
        spec:
          containers:
            - name: deepspeech
              image: seannaren/deepspeech.pytorch
              imagePullPolicy: Always
              command: ["python", "-m", "torchelastic.distributed.launch"]
              args:
                - "--nproc_per_node=1"
                - "/root/deepspeech.pytorch/train.py"
                - "data.train_manifest=/audio-data/an4_manifests/an4_train_manifest.csv"
                - "data.val_manifest=/audio-data/an4_manifests/an4_val_manifest.csv"
                - "data.num_workers=8"
                - "training.epochs=70"
                - "data.batch_size=8"
                - "checkpointing.save_folder=/audio-data/models/"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /audio-data/
                  name: audio-data
                  readOnly: true
          volumes:
            - name: audio-data
              persistentVolumeClaim:
                claimName: audio-data-claim
                readOnly: true

Commands

gcloud container clusters create torchelastic \
    --accelerator type=nvidia-tesla-v100,count=1\
    --machine-type=n1-standard-4 \
    --disk-size=25Gi \
    --zone=us-west1-b \
    --cluster-version=1.15 \
    --preemptible \
    --num-nodes=1

# Within torch elastic/kubernetes repo...
kubectl apply -k config/default
kubectl apply -f https://raw.githubusercontent.com/pytorch/elastic/master/kubernetes/config/samples/etcd.yaml

gcloud compute disks create --size 10Gi audio-data --zone us-west1-b
kubectl apply -f storage.yaml
kubectl apply -f persistent_volume.yaml

kubectl apply -f transfer_data.yaml
## Once finished transferring by checking kubectl logs transfer-data
kubectl delete -f transfer_data.yaml

kubectl apply -f train.yaml # This fails

However running this shows that the disk has been populated when checking the logs:

check_storage.yaml

apiVersion: v1
kind: Pod
metadata:
  name: check-storage
spec:
  containers:
    - image: seannaren/deepspeech.pytorch
      imagePullPolicy: Always
      name: deepspeech
      command: ["/bin/ls"]
      args:
        - "/audio-data/"
      volumeMounts:
        - mountPath: /audio-data/
          name: audio-data
  restartPolicy: Never
  volumes:
    - name: audio-data
      persistentVolumeClaim:
        claimName: audio-data-claim

Any help would be appreciated!

EDIT: I've tried both removing and adding the readOnly flag and it doesn't fix the issue

How to programmatically determine if a training job has finished using `kubectl`?

❓ Questions and Help

How to programmatically determine if a training job has finished using kubectl?
The field status.replicaStatuses.Worker.succeeded seems to indicate the number of succeeded pods.
How does one determine if the whole job has succeeded?
This is useful when the training job is part of a workflow (e.g. orchestrated by argo or airflow).

Please note that this issue tracker is not a help form and this issue will be closed.

Before submitting, please ensure you have gone through our documentation. Here
are some links that may be helpful:

Question

Improve docs on writing training scripts compatible with scale down

📚 Documentation

Link

What does it currently say?

Current doc does not have explicit instructions on setting NCCL_BLOCKING_WAIT which is essential for scale-downs with nccl as the process group backend since it ensures that workers do not get blocked forever waiting for others in the nccl kernel. See #115 for more context.

What should it say?

Mention that:

NCCL_BLOCKING_WAIT=1 should be set as environment variable
When #1 is true, then the timeout parameter in dist.init_process_group() becomes the time (in seconds) that the nccl watchdog (in pytorch) times out when nccl kernels do not return prompty. The default is 30min which may be too long for scale down events. Document that the user should set this parameter to whatever makes sense for their application - will be a function of the frequency of scale-down events and the "size" of the application's nccl operations. Setting this value to something too small will result in false positives where normal long-running nccl kernels are timed out.

Why?

Not setting NCCL_BLOCKING_WAIT results in the application not being able to properly scale-down.

imagenet example does not work with nccl

🐛 Bug

TLDR; image net example crashes with nccl
See #61

Component (check all that applies):

To Reproduce

Run the example with backend=="nccl"

Expected behavior

Should run

Environment

Does not matter

Additional context

Two potential fixes:

Create the state_tensor and state_size tensor (https://github.com/pytorch/elastic/blob/master/torchelastic/p2p/coordinator_p2p.py#L153) as a gpu tensors if the backend is "nccl"
Create a secondary gloo-based process group to sync state from max_rank.

EtcdStore: No constructor defined

Hi,

I've updated my torchelastic to latest (including 393a26c commit) and PyTorch to 1.4.

My test setup used to work OK with TCPStore, now I get an error:

INFO 2020-01-23 01:39:31,128 Creating EtcdStore as the c10d::Store implementation
[ERROR] 2020-01-23 01:39:31,139 coordinator_p2p: Rank: -1
Error: Rank -1 received an Exception. Detailed message: EtcdStore: No constructor defined!
ErrorType: <class 'torchelastic.coordinator.NonRetryableException'>
StackTrace: Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.6/site-packages/torchelastic-0.1.0rc2-py3.6.egg/torchelastic/p2p/coordinator_p2p.py", line 64, in rendezvous_barrier
    self.store, self.rank, self.world_size = self.rendezvous.next_rendezvous()
  File "/opt/miniconda/lib/python3.6/site-packages/torchelastic-0.1.0rc2-py3.6.egg/torchelastic/rendezvous/etcd_rendezvous.py", line 98, in next_rendezvous
    store = self._rdzv_impl.setup_kv_store(rdzv_version)
  File "/opt/miniconda/lib/python3.6/site-packages/torchelastic-0.1.0rc2-py3.6.egg/torchelastic/rendezvous/etcd_rendezvous.py", line 851, in setup_kv_store
    return EtcdStore(etcd_client=self.client, etcd_store_prefix=store_path)
  File "/opt/miniconda/lib/python3.6/site-packages/torchelastic-0.1.0rc2-py3.6.egg/torchelastic/rendezvous/etcd_rendezvous.py", line 865, in __init__
    super().__init__()  # required for pybind trampoline.
TypeError: EtcdStore: No constructor defined!

Is EtcdStore should to be ready to use now or some code updates are still needed?

	if start_index >= len(dataset):
	raise ValueError(
	"Start index {} should be less than dataset size {}".format(
	start_index, len(dataset)
	)
	)

	self.data_loader = torch.utils.data.DataLoader(
	self.dataset,
	batch_size=self.total_batch_size,
	shuffle=(sampler is None),
	num_workers=num_data_workers,
	pin_memory=True,
	sampler=sampler,
	multiprocessing_context=None if num_data_workers == 0 else "forkserver",

	func (r *ElasticJobReconciler) GetDefaultContainerPortNumber() int32 {
	return v1alpha1.DefaultPort
	}

	func (r *ElasticJobReconciler) GetDefaultContainerPortName() string {
	return v1alpha1.DefaultContainerPortName
	}