In utils/gpu.py the GPU 0 is hardcoded: <div class="snippet-clipboard-content notr

Did that, immediate crash at xformers. <div class="snippet-clipboard-content notra

GPU ID 0 is hardcoded in some places about everydream2trainer HOT 10 CLOSED

victorchall commented on June 10, 2024

GPU ID 0 is hardcoded in some places

from everydream2trainer.

Comments (10)

Vega-KH commented on June 10, 2024 1

I'm going to close this issue as it no longer has anything to do with the original title.

Just wanted to say that this is the best tech support I've had on any software, and you're doing it all for free. Thanks for all your work on this software.

from everydream2trainer.

victorchall commented on June 10, 2024

I'll take a look at the logging, but to help with your out of memory error you'll need to post your _cfg.json and tell me what GPU you're using.

from everydream2trainer.

victorchall commented on June 10, 2024

I updated the gpu id query to use the device id. I have no system to test myself but it should work.

I can't help with the CUDA out of memory error without a log.

from everydream2trainer.

Vega-KH commented on June 10, 2024

Thanks, I will run it this evening and post a log. I appreciate the help.

from everydream2trainer.

Vega-KH commented on June 10, 2024

I am still having VRAM issues. I started tonight by checking device IDS. I was surprised that the GPU ids in CUDA are opposite of those report by windows task manager. I activated VENV and got this:

>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4070 Ti'
>>> torch.cuda.get_device_name(1)
'NVIDIA GeForce GT 1030'

Obviously the 4070 with 12 GB is what I'm hoping to train on. I put the 1030 in just so I could run Windows on it, and have my entire 12 GB of VRAM on the 4070 ti available to train. So I changed the gpuid in train.json to 0, and still got out of memory error. The log is attached below. One thing weird to me is that on launching train.py, I see this:
Pretraining GPU Memory: 1024 / 2048 MB
Shouldn't that say 12 GB instead of 2 GB? But, maybe it is working properly because when it crashed it said this:

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 11.99 GiB total capacity; 11.05 GiB already allocated; 0 bytes free; 11.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So idk at this point if the program is working on the proper GPU but just cannot train with 12 GB, or if the error has to do with having 2 GPUs.

venus-20230208-200215.log
venus-20230208-200215_cfg.json.txt

from everydream2trainer.

victorchall commented on June 10, 2024

Another alternative is to use the CUDA_VISIBLE_DEVICES env var

Open command line, try running this:

set CUDA_VISIBLE_DEVICES=0,
nvidia-smi

(see what GPU it prints out)

set CUDA_VISIBLE_DEVICES=1,
nvidia-smi

(see what GPU it prints out)

The env var will work for torch/ED2 as well.

from everydream2trainer.

Vega-KH commented on June 10, 2024

Thanks for the help. I didn't mention before, but setting gpuid to 0 causes it to crash faster, as soon as it loads xformers it crashes.

So I tried what you just said "set CUDA_VISIBLE_DEVICES=0"

Grad scaler enabled: True (amp mode)
Epochs:   0%|                                                           | 0/10 [00:00<?, ?it/s, vram=1122/2048 MB gs:0]Something went wrong, attempting to save model                         | 3/557 [00:13<29:10,  3.16s/it, loss/step=0.117]

Still that weird reported issue with vram being 2 GB. So I tried it the other way, set "CUDA_VISIBLE_DEVICES=1" and got the immediate crash at xformers. Doing nvidia-smi gave a weird result:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.24       Driver Version: 528.24       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:03:00.0  On |                  N/A |
| 87%   41C    P0    N/A /  30W |    775MiB /  2048MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ... WDDM  | 00000000:09:00.0 Off |                  N/A |
|  0%   32C    P8     2W / 285W |      0MiB / 12282MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

There is has the 4070 as GPU id 1. Doesn't seem to make sense.

My only other idea is that maybe there is some incompatibility with 40 series cards. Maybe I need to update cuDNN or something

from everydream2trainer.

victorchall commented on June 10, 2024

The nvidia-smi id is what is going to count and it seems to think your 12GB card is ID 1.

Start a fresh command line so the env var is released, then try setting your "gpuid": 1 in the everydream2 config and run it,

Also when you use set CUDA_VISIBLE_DEVICES=1, leave the trailing comma, I believe it is required. Again I'm unable to test as I don't have a second nvidia GPU.

from everydream2trainer.

Vega-KH commented on June 10, 2024

Did that, immediate crash at xformers.

Enabled xformers
Traceback (most recent call last):
  File "C:\everydream\EveryDream2trainer\train.py", line 1043, in <module>
    main(args)
  File "C:\everydream\EveryDream2trainer\train.py", line 577, in main
    unet = unet.to(device, dtype=torch.float32)
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in to
    return self._apply(convert)
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply
    param_applied = fn(param)
  File "C:\everydream\EveryDream2trainer\venv\lib\site-packages\torch\nn\modules\module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The errors are different, so it is definitely crashing when trying to train on either card. I think I will just pull the 1030 out and see if I can train on the 4070 ti, even when Windows is also using the card. If that fails, then I'll just train on Runpod. That's what I should be doing anyway. 12 GB is probably not enough to really train.

from everydream2trainer.

victorchall commented on June 10, 2024

Still seems to be somehow hitting the wrong GPU. In an ideal world you can just use the CUDA_VISIBLE_DEVICES and leave ED2 to gpuid 0 and it would completely put blinders on python to the one gpu.

from everydream2trainer.

GPU ID 0 is hardcoded in some places about everydream2trainer HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs