GithubHelp home page GithubHelp logo

Comments (6)

ppanchad-amd avatar ppanchad-amd commented on July 17, 2024

@madness742 Internal ticket has been created to investigate this issue. Thanks!

from rocm.

madness742 avatar madness742 commented on July 17, 2024

Hi @ppanchad-amd! I'd like to add that, in addition to the configuration I mentioned in my steps to reproduce section, enabling a upscaler (RealESGRAN 4x+ Anime6B) (Rescale by 1.5 - 2, depending on available VRAM) combined with Force HiRes (20 steps) enabled will cause a system freeze much more frequently. These options can be found under "Refine".

image

I have updated kernel-firmware-amdgpu after reading a comment by Alex Deucher . Currently on Version: 20240618-1.1. I have also updated to ROCm 6.1.3, but even when following the suggestion of trying to keep some VRAM available the freezes will occur.

Much less often, but they still occur nonetheless. It even occurred once after waking up the system after it successfully generated a batch (count) of 100 images and went into sleep mode.

I hope this additional information can be of use during the investigation!

from rocm.

jamesxu2 avatar jamesxu2 commented on July 17, 2024

Hello @madness742 , thanks for expanding on your configuration. I used your original issue to attempt to reproduce #2935 and have retried it by enabling the Upscaler + Force HiRes option, using SD.Next.

I do notice frequent out-of-memory errors and am running the automatic SD.Next webui with the --lowvram option as a result. With your configuration, I generated 100+ images continuously over a course of several hours and didn't encounter your crash.

System Configuration:

  • RX7900XT
  • Ubuntu 22.04
  • Both Torch2.5.0+rocm6.1 and Torch 2.3.1+rocm5.7

I have some follow-ups:

  1. Assuming this issue may be VRAM related, have you tried running with the --lowvram option?
  2. What's the approximate frequency that crashes occur on your machine? How many images with the upscaler + ForceHiRes option are generated before encountering a crash?

from rocm.

madness742 avatar madness742 commented on July 17, 2024

Hi @jamesxu2, I've been extensively using SD.Next since your message.

  1. I tried lowering the VRAM usage by setting different generation parameters. It was on average using 72% of the total VRAM during a 100 batch count generation. Even tried generating 10 pictures (1024x1024) at the same time using the batch size option. Haven't encountered a crash/freeze so far on this configuration after updating SD.Next and the host system (openSUSE Tumbleweed). Currently on snapshot 20240629.
  2. With the mentioned configuration (1024x1024, Forced HiRes, Upscaler) it would either crash instantly upon hitting generate, or within 10 minutes. I was generating one picture at a time. It's very random to when it crashes the whole system.

The GPU stats during the 100 batch count generation:
image

The logs after the systems runs out of VRAM and hard freezes:

Jul 01 07:23:51 localhost.localdomain kwin_wayland[127968]: kwin_core: Cannot grant a token to KWin::ClientConnection(0x5618130c52a0)
Jul 01 07:24:35 localhost.localdomain kwin_wayland[127968]: kwin_libinput: Libinput: event2  - Compx Pulsar Xlite Wireless: client bug: event processing lagging behind by 30ms, your system is too slow
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:40 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:24:46 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:24:46 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:24:46 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:24:48 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:08 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Page flip failed: Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kwin_wayland[127968]: kwin_wayland_drm: Presentation failed! Cannot allocate memory
Jul 01 07:25:59 localhost.localdomain kernel: amdgpu 0000:03:00.0: amdgpu: 00000000591fde63 pin failed
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Jul 01 07:25:59 localhost.localdomain kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

Last week SD.Next received a fix for memory exceptions on ROCm. I have updated to that version a couple days ago, and it has made a big difference. Other changes I've made was to set the GPU power profile to COMPUTE (from BOOTUP_DEFAULT) and raise the power limit to 402000000 (from 339000000). I've also changed the vBios on the GPU to overclock mode by pushing a physical switch on my GPU.

I have not tested if these changes above also help on Ubuntu 22.04 when installed as a host. I also haven't tested A111/ComfyUI.

Host configuration:

  • openSUSE Tumbleweed
  • Kernel: 6.9.7-1-default
  • AMD Sapphire 7900 XTX Nitro+

Container configuration:

  • Ubuntu 22.04
  • Python: 3.11.9
  • Torch 2.5.0.dev20240621
  • ROCm 6.1 (6.1.60103-1)
  • SD.Next (2024-06-24)

It hasn't crashed once since I made all those changes in combination of not exceeding 16-17gb of VRAM usage. In the past it would still randomly crash despite the relatively moderate VRAM usage.

cat /sys/class/drm/card1/device/pp_power_profile_mode 
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
 0 BOOTUP_DEFAULT :
                    0(       GFXCLK)       0       1       0       4     800 4587520  -65536       0
                    1(         FCLK)       0       3       0       1       0 3276800  -65536   -6553
 1 3D_FULL_SCREEN :
                    0(       GFXCLK)       0       0    1200       4     650 3932160   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 2   POWER_SAVING :
                    0(       GFXCLK)       0       1       0       3       0 5898240  -65536       0
                    1(         FCLK)       0       1       0       1       0 3407872  -65536   -6553
 3          VIDEO :
                    0(       GFXCLK)       0       1       0       4     500 4587520  -65536       0
                    1(         FCLK)       0       3       0       3       0 3473408  -65536   -6553
 4             VR :
                    0(       GFXCLK)       0       2    1000       1       0 3276800       0       0
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 5        COMPUTE*:
                    0(       GFXCLK)       0       2    1000       1       0 3932160       0       0
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 6         CUSTOM :
                    0(       GFXCLK)       0       0    1200       4       0  655360   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553
 7      WINDOW_3D :
                    0(       GFXCLK)       0       0    1200       4     650 5242880   -3276  -65536
                    1(         FCLK)       0       3       0       3       0 1310720   -6553   -6553

from rocm.

jamesxu2 avatar jamesxu2 commented on July 17, 2024

@madness742 I'm glad to hear those changes have mitigated your crashes. Given that you aren't seeing any more crashes in your current configuration, can this issue be closed?

from rocm.

madness742 avatar madness742 commented on July 17, 2024

Hi @jamesxu2, i'll go ahead and close this issue.

ComfyUI is working fine as well (generated 300 pics) after the power plan and power limit adjustment, as long as it doesn't spike my VRAM which causes me to get OOM and freeze. I was unable to test A1111 as it required much more VRAM than the other two and I had trouble loading a SDXL model at the recommended 1024x1024 resolutions.

Interestingly setting it to COMPUTE also solves this issue in a specific game when it comes to VRR on a dual monitor setup. Probably not related, but might be worth mentioning.

from rocm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.