Comments (6)
@madness742 Internal ticket has been created to investigate this issue. Thanks!
from rocm.
Hi @ppanchad-amd! I'd like to add that, in addition to the configuration I mentioned in my steps to reproduce section, enabling a upscaler (RealESGRAN 4x+ Anime6B) (Rescale by 1.5 - 2, depending on available VRAM) combined with Force HiRes (20 steps) enabled will cause a system freeze much more frequently. These options can be found under "Refine".
I have updated kernel-firmware-amdgpu
after reading a comment by Alex Deucher . Currently on Version: 20240618-1.1
. I have also updated to ROCm 6.1.3, but even when following the suggestion of trying to keep some VRAM available the freezes will occur.
Much less often, but they still occur nonetheless. It even occurred once after waking up the system after it successfully generated a batch (count) of 100 images and went into sleep mode.
I hope this additional information can be of use during the investigation!
from rocm.
Hello @madness742 , thanks for expanding on your configuration. I used your original issue to attempt to reproduce #2935 and have retried it by enabling the Upscaler + Force HiRes option, using SD.Next.
I do notice frequent out-of-memory errors and am running the automatic SD.Next webui with the --lowvram option as a result. With your configuration, I generated 100+ images continuously over a course of several hours and didn't encounter your crash.
System Configuration:
- RX7900XT
- Ubuntu 22.04
- Both Torch2.5.0+rocm6.1 and Torch 2.3.1+rocm5.7
I have some follow-ups:
- Assuming this issue may be VRAM related, have you tried running with the --lowvram option?
- What's the approximate frequency that crashes occur on your machine? How many images with the upscaler + ForceHiRes option are generated before encountering a crash?
from rocm.
Hi @jamesxu2, I've been extensively using SD.Next since your message.
- I tried lowering the VRAM usage by setting different generation parameters. It was on average using 72% of the total VRAM during a 100 batch count generation. Even tried generating 10 pictures (1024x1024) at the same time using the batch size option. Haven't encountered a crash/freeze so far on this configuration after updating SD.Next and the host system (openSUSE Tumbleweed). Currently on snapshot 20240629.
- With the mentioned configuration (1024x1024, Forced HiRes, Upscaler) it would either crash instantly upon hitting generate, or within 10 minutes. I was generating one picture at a time. It's very random to when it crashes the whole system.
The GPU stats during the 100 batch count generation:
The logs after the systems runs out of VRAM and hard freezes:
Jul 01 07:23:51 localhost.localdomain kwin_wayland[127968]: kwin_core: Cannot grant a token to KWin::ClientConnection(0x5618130c52a0)
Jul 01 07:24:35 localhost.localdomain kwin_wayland[127968]: kwin_libinput: Libinput: event2 - Compx Pulsar Xlite Wireless: client bug: event processing lagging behind by 30ms, your system is too slow
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:24:46 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:24:46 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Last week SD.Next received a fix for memory exceptions on ROCm. I have updated to that version a couple days ago, and it has made a big difference. Other changes I've made was to set the GPU power profile to COMPUTE
(from BOOTUP_DEFAULT
) and raise the power limit to 402000000
(from 339000000
). I've also changed the vBios on the GPU to overclock mode by pushing a physical switch on my GPU.
I have not tested if these changes above also help on Ubuntu 22.04 when installed as a host. I also haven't tested A111/ComfyUI.
Host configuration:
- openSUSE Tumbleweed
- Kernel: 6.9.7-1-default
- AMD Sapphire 7900 XTX Nitro+
Container configuration:
- Ubuntu 22.04
- Python: 3.11.9
- Torch 2.5.0.dev20240621
- ROCm 6.1 (6.1.60103-1)
- SD.Next (2024-06-24)
It hasn't crashed once since I made all those changes in combination of not exceeding 16-17gb of VRAM usage. In the past it would still randomly crash despite the relatively moderate VRAM usage.
cat /sys/class/drm/card1/device/pp_power_profile_mode
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
0 BOOTUP_DEFAULT :
0( GFXCLK) 0 1 0 4 800 4587520 -65536 0
1( FCLK) 0 3 0 1 0 3276800 -65536 -6553
1 3D_FULL_SCREEN :
0( GFXCLK) 0 0 1200 4 650 3932160 -3276 -65536
1( FCLK) 0 3 0 3 0 1310720 -6553 -6553
2 POWER_SAVING :
0( GFXCLK) 0 1 0 3 0 5898240 -65536 0
1( FCLK) 0 1 0 1 0 3407872 -65536 -6553
3 VIDEO :
0( GFXCLK) 0 1 0 4 500 4587520 -65536 0
1( FCLK) 0 3 0 3 0 3473408 -65536 -6553
4 VR :
0( GFXCLK) 0 2 1000 1 0 3276800 0 0
1( FCLK) 0 3 0 3 0 1310720 -6553 -6553
5 COMPUTE*:
0( GFXCLK) 0 2 1000 1 0 3932160 0 0
1( FCLK) 0 3 0 3 0 1310720 -6553 -6553
6 CUSTOM :
0( GFXCLK) 0 0 1200 4 0 655360 -3276 -65536
1( FCLK) 0 3 0 3 0 1310720 -6553 -6553
7 WINDOW_3D :
0( GFXCLK) 0 0 1200 4 650 5242880 -3276 -65536
1( FCLK) 0 3 0 3 0 1310720 -6553 -6553
from rocm.
@madness742 I'm glad to hear those changes have mitigated your crashes. Given that you aren't seeing any more crashes in your current configuration, can this issue be closed?
from rocm.
Hi @jamesxu2, i'll go ahead and close this issue.
ComfyUI is working fine as well (generated 300 pics) after the power plan and power limit adjustment, as long as it doesn't spike my VRAM which causes me to get OOM and freeze. I was unable to test A1111 as it required much more VRAM than the other two and I had trouble loading a SDXL model at the recommended 1024x1024 resolutions.
Interestingly setting it to COMPUTE
also solves this issue in a specific game when it comes to VRR on a dual monitor setup. Probably not related, but might be worth mentioning.
from rocm.
Related Issues (20)
- [Issue]: High CPU usage even when GPU(APU) is being used HOT 5
- [Feature]: ROCm Support for AMD Ryzen 9 7940HS with Radeon 780M Graphics HOT 2
- amd bugged in 7900xtx on linux damaging my display driver HOT 3
- HIP error: invalid device function HOT 2
- [Issue]: ROCm on WSL2 failed on RX 7800 XT HOT 7
- Wrong LDS size when build with hipcc on RDNA => NOT HOT 5
- [Documentation]: Script to get changelog fails on some repos
- hipLaunchCooperativeKernel slowdown HOT 1
- [Documentation]: The build from source is missing multiple dependencies HOT 3
- [Issue]: HOT 3
- [Issue]: VM crashes with error: kvm run failed Bad address from ProxmoxVE 8.2.4 HOT 3
- [Issue]: Error while installing rocm-bandwidth-test package on ubuntu 22.04 with rocm 6.0.2 HOT 1
- [Issue]: HIP programs hang inside amdhip64.dll after main HOT 1
- [Feature]: support image instructions for MI300X? HOT 2
- [Documentation]: Have autotag script account for no lib version
- [Documentation]: Add RVS version to changelog
- [Documentation]: Remove custom template for ck changelog from autotag
- [Documentation]: Remove custom template for mivisionx changelog from autotag
- [Documentation]: Remove custom template for rpp changelog from autotag
- [Documentation]: Remove unused configurations in conf.py
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rocm.