Problem Deion Brief summary: When genera

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[Issue]: 7900 XTX ERROR ring gfx_0.0.0 timeout in PyTorch about rocm HOT 6 CLOSED

madness742 commented on July 17, 2024

[Issue]: 7900 XTX *ERROR* ring gfx_0.0.0 timeout in PyTorch

from rocm.

Comments (6)

ppanchad-amd commented on July 17, 2024

@madness742 Internal ticket has been created to investigate this issue. Thanks!

from rocm.

madness742 commented on July 17, 2024

Hi @ppanchad-amd! I'd like to add that, in addition to the configuration I mentioned in my steps to reproduce section, enabling a upscaler (RealESGRAN 4x+ Anime6B) (Rescale by 1.5 - 2, depending on available VRAM) combined with Force HiRes (20 steps) enabled will cause a system freeze much more frequently. These options can be found under "Refine".

I have updated kernel-firmware-amdgpu after reading a comment by Alex Deucher . Currently on Version: 20240618-1.1. I have also updated to ROCm 6.1.3, but even when following the suggestion of trying to keep some VRAM available the freezes will occur.

Much less often, but they still occur nonetheless. It even occurred once after waking up the system after it successfully generated a batch (count) of 100 images and went into sleep mode.

I hope this additional information can be of use during the investigation!

from rocm.

jamesxu2 commented on July 17, 2024

Hello @madness742 , thanks for expanding on your configuration. I used your original issue to attempt to reproduce #2935 and have retried it by enabling the Upscaler + Force HiRes option, using SD.Next.

I do notice frequent out-of-memory errors and am running the automatic SD.Next webui with the --lowvram option as a result. With your configuration, I generated 100+ images continuously over a course of several hours and didn't encounter your crash.

System Configuration:

RX7900XT
Ubuntu 22.04
Both Torch2.5.0+rocm6.1 and Torch 2.3.1+rocm5.7

I have some follow-ups:

Assuming this issue may be VRAM related, have you tried running with the --lowvram option?
What's the approximate frequency that crashes occur on your machine? How many images with the upscaler + ForceHiRes option are generated before encountering a crash?

from rocm.

madness742 commented on July 17, 2024

Hi @jamesxu2, I've been extensively using SD.Next since your message.

I tried lowering the VRAM usage by setting different generation parameters. It was on average using 72% of the total VRAM during a 100 batch count generation. Even tried generating 10 pictures (1024x1024) at the same time using the batch size option. Haven't encountered a crash/freeze so far on this configuration after updating SD.Next and the host system (openSUSE Tumbleweed). Currently on snapshot 20240629.
With the mentioned configuration (1024x1024, Forced HiRes, Upscaler) it would either crash instantly upon hitting generate, or within 10 minutes. I was generating one picture at a time. It's very random to when it crashes the whole system.

The GPU stats during the 100 batch count generation:

The logs after the systems runs out of VRAM and hard freezes:

Jul 01 07:23:51 localhost.localdomain kwin_wayland[127968]: kwin_core: Cannot grant a token to KWin::ClientConnection(0x5618130c52a0)
Jul 01 07:24:35 localhost.localdomain kwin_wayland[127968]: kwin_libinput: Libinput: event2  - Compx Pulsar Xlite Wireless: client bug: event processing lagging behind by 30ms, your system is too slow
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:24:46 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:24:46 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

Last week SD.Next received a fix for memory exceptions on ROCm. I have updated to that version a couple days ago, and it has made a big difference. Other changes I've made was to set the GPU power profile to COMPUTE (from BOOTUP_DEFAULT) and raise the power limit to 402000000 (from 339000000). I've also changed the vBios on the GPU to overclock mode by pushing a physical switch on my GPU.

I have not tested if these changes above also help on Ubuntu 22.04 when installed as a host. I also haven't tested A111/ComfyUI.

Host configuration:

openSUSE Tumbleweed
Kernel: 6.9.7-1-default
AMD Sapphire 7900 XTX Nitro+

Container configuration:

Ubuntu 22.04
Python: 3.11.9
Torch 2.5.0.dev20240621
ROCm 6.1 (6.1.60103-1)
SD.Next (2024-06-24)

It hasn't crashed once since I made all those changes in combination of not exceeding 16-17gb of VRAM usage. In the past it would still randomly crash despite the relatively moderate VRAM usage.

cat /sys/class/drm/card1/device/pp_power_profile_mode 
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
 0 BOOTUP_DEFAULT :
                    0(       GFXCLK)       0       1       0       4     800 4587520  -65536       0
                    1(         FCLK)       0       3       0       1       0 3276800  -65536   -6553
 1 3D_FULL_SCREEN :
                    0(       GFXCLK)       0       0    1200       4     650 3932160   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 2   POWER_SAVING :
                    0(       GFXCLK)       0       1       0       3       0 5898240  -65536       0
                    1(         FCLK)       0       1       0       1       0 3407872  -65536   -6553
 3          VIDEO :
                    0(       GFXCLK)       0       1       0       4     500 4587520  -65536       0
                    1(         FCLK)       0       3       0       3       0 3473408  -65536   -6553
 4             VR :
                    0(       GFXCLK)       0       2    1000       1       0 3276800       0       0
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 5        COMPUTE*:
                    0(       GFXCLK)       0       2    1000       1       0 3932160       0       0
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 6         CUSTOM :
                    0(       GFXCLK)       0       0    1200       4       0  655360   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 7      WINDOW_3D :
                    0(       GFXCLK)       0       0    1200       4     650 5242880   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553

from rocm.

jamesxu2 commented on July 17, 2024

@madness742 I'm glad to hear those changes have mitigated your crashes. Given that you aren't seeing any more crashes in your current configuration, can this issue be closed?

from rocm.

madness742 commented on July 17, 2024

Hi @jamesxu2, i'll go ahead and close this issue.

ComfyUI is working fine as well (generated 300 pics) after the power plan and power limit adjustment, as long as it doesn't spike my VRAM which causes me to get OOM and freeze. I was unable to test A1111 as it required much more VRAM than the other two and I had trouble loading a SDXL model at the recommended 1024x1024 resolutions.

Interestingly setting it to COMPUTE also solves this issue in a specific game when it comes to VRR on a dual monitor setup. Probably not related, but might be worth mentioning.

from rocm.

[Issue]: 7900 XTX ERROR ring gfx_0.0.0 timeout in PyTorch about rocm HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs