GithubHelp home page GithubHelp logo

Comments (6)

ronny1996 avatar ronny1996 commented on May 16, 2024 1

请提出你的问题 Please ask your question

stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数

stream_safe_cuda_allocation->MarkAsWillBeFreed();

该函数的功能是若当前allocation所在stream没有绑定event,则新建event并record到stream上(MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误,will_be_freed_一直为false)

void StreamSafeCustomDeviceAllocation::MarkAsWillBeFreed() {

我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么?是否有必要?我理解释放显存前保证已有event全部完成即可,增加新的event是否会造成性能下降甚至出现功能异常?我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的;gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作,但是是对graph_capturing_stream_set_中的stream进行的,而该set似乎通常情况下为空?

if (UNLIKELY(phi::backends::gpu::CUDAGraph::IsThisThreadCapturing())) {

实际使用中,在开启custom device的stream safe allocator时,我遇到了2个bug:
1.在多进程场景下,当cpu的allocator和custom device的allocator同时有释放存储的操作时(custom devcice稍早于cpu),cpu allocator通过MarkAsWillBeFreed record的event为空(推测在custom device的allocator调用CanBeFreed时被删除),导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型,开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放,出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除?

你好,custom stream safe allocator 实现的是非cuda graph,即 phi::backends::gpu::CUDAGraph::IsThisThreadCapturing() = false 的情况,MarkAsWillBeFreed 确实可以删除,outstanding_event_map_已有的stream也应该复用其对应的event。我们提个pr修改下。

from paddle.

ronny1996 avatar ronny1996 commented on May 16, 2024 1

#63369 这个pr修复了

from paddle.

continue-coding avatar continue-coding commented on May 16, 2024

@ronny1996 麻烦实现stream_safe_custom_device_allocator的大佬帮忙解答一下,谢谢:)

from paddle.

continue-coding avatar continue-coding commented on May 16, 2024

我还有个疑问,StreamSafeCustomDeviceAllocation::RecordStream仅对outstanding_event_map_中不存在的stream做了record event操作,outstanding_event_map_已有的stream是否应该复用其对应的event,做record event呢?

if (it == outstanding_event_map_.end()) {

from paddle.

continue-coding avatar continue-coding commented on May 16, 2024

请提出你的问题 Please ask your question

stream_safe_custom_device_allocator在释放显存前会调用MarkAsWillBeFreed函数

stream_safe_cuda_allocation->MarkAsWillBeFreed();

该函数的功能是若当前allocation所在stream没有绑定event,则新建event并record到stream上(MarkAsWillBeFreed中对flag will_be_freed_的修改似乎有误,will_be_freed_一直为false)

void StreamSafeCustomDeviceAllocation::MarkAsWillBeFreed() {

我比较疑惑的是调用MarkAsWillBeFreed函数的作用是什么?是否有必要?我理解释放显存前保证已有event全部完成即可,增加新的event是否会造成性能下降甚至出现功能异常?我看到xpu的stream safe allocator中释放显存前是通过CanBeFreed函数查询所有event的状态来决定是否直接释放的;gpu的stream safe allocator的CanBeFreed函数中虽然有record stream的操作,但是是对graph_capturing_stream_set_中的stream进行的,而该set似乎通常情况下为空?

if (UNLIKELY(phi::backends::gpu::CUDAGraph::IsThisThreadCapturing())) {

实际使用中,在开启custom device的stream safe allocator时,我遇到了2个bug:
1.在多进程场景下,当cpu的allocator和custom device的allocator同时有释放存储的操作时(custom devcice稍早于cpu),cpu allocator通过MarkAsWillBeFreed record的event为空(推测在custom device的allocator调用CanBeFreed时被删除),导致后续CanBeFreed中query时报错
2.不开stream safe allocator可以正常运行的模型,开启后可能因为调用MarkAsWillBeFreed新增event导致显存未能及时释放,出现oom
以上bug在取消对MarkAsWillBeFreed的调用后均得到解决。所以想请教下飞桨的大佬MarkAsWillBeFreed的调用是否可以删除?

你好,custom stream safe allocator 实现的是非cuda graph,即 phi::backends::gpu::CUDAGraph::IsThisThreadCapturing() = false 的情况,MarkAsWillBeFreed 确实可以删除,outstanding_event_map_已有的stream也应该复用其对应的event。我们提个pr修改下。

感谢大佬的回复,我的疑问得到了解决:)

from paddle.

continue-coding avatar continue-coding commented on May 16, 2024

#63369 这个pr修复了

感谢感谢!这个issue可以关闭了。

from paddle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.