Describe the bug As illustrated below，DeepSpeed's overlap buffer

Additional context <div class="highlight highlight-text-adblock n

[BUG] deepspeed overlap_comm data race,about microsoft/deepspeed

Comments (1)

yangyihang-bytedance commented on July 22, 2024

Additional context

deepspeed==0.14.2

CSAN detected a possible data race on tensor with data pointer 140108946735104
Access by stream 0 during kernel:
aten::slice.Tensor(Tensor(a) self, int dim=0, SymInt? start=None, SymInt? end=None, SymInt step=1) -> Tensor(a)
writing to argument(s) self, and to the output
With stack trace:
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 949, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
  File "/usr/local/lib/python3.9/dist-packages/torch/cuda/_sanitizer.py", line 570, in __torch_dispatch__
    errors = self.event_handler._handle_kernel_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/cuda/_sanitizer.py", line 371, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

Previous access by stream 152815408 during kernel:
aten::view(Tensor(a) self, SymInt[] size) -> Tensor(a)
writing to argument(s) self, and to the output
With stack trace:
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 932, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_ipg_grads()
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1367, in reduce_ipg_grads
    self.average_tensor(self.ipg_buffer[self.ipg_index].narrow(0, 0, self.elements_in_ipg_bucket))
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1127, in average_tensor
    self.allreduce_and_scatter(buckets[bucket_key],
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1031, in allreduce_and_scatter
    self.allreduce_and_copy_with_multiple_ranks(small_bucket,
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1005, in allreduce_and_copy_with_multiple_ranks
    for buf, synced, bucket_rank in zip(small_bucket, self.unflatten(allreduced, small_bucket), bucket_ranks):
  File "/usr/local/lib/python3.9/dist-packages/torch/_utils.py", line 534, in _unflatten_dense_tensors
    return torch._C._nn.unflatten_dense_tensors(flat, tensors)
  File "/usr/local/lib/python3.9/dist-packages/torch/cuda/_sanitizer.py", line 570, in __torch_dispatch__
    errors = self.event_handler._handle_kernel_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/cuda/_sanitizer.py", line 371, in _handle_kernel_launch
    stack_trace = traceback.StackSummary.extract(

from deepspeed.

Recommend Projects