Since pulling the latest jinaai/discoart docker image

Ok so this is good to know! -- Also thanks for the insight into what <code class="notr

What i tried <a href="https://github.com/jina-ai/discoart/issues/43#issuecomment-11881

CUDA memory error since last update about discoart HOT 13 CLOSED

jina-ai commented on May 21, 2024

CUDA memory error since last update

from discoart.

Comments (13)

hanxiao commented on May 21, 2024 1

found it!

from discoart.

jwelch1324 commented on May 21, 2024

On Dockerhub the digests are all the same up to 6.3 -- so everything up to 6.3 presumably works as well --

Can confirm that tag 0.6.4 is where the change occurred that is causing the OOM error with these params

from discoart.

hanxiao commented on May 21, 2024

but oddly I cannot find the tag of an image on dockerhub that has this hash anymore.

latest is default, there is no need to specify it, it just a default tag by Docker specification

Can confirm that tag 0.6.4 is where the change occurred that is causing the OOM error with these params

This is the comparison between 0.6.3 and 0.6.4, i don't see a significant change here

v0.6.3...v0.6.4

from discoart.

hanxiao commented on May 21, 2024

also, could you show the full error trace of OOM, i'd like to see where it comes from?

from discoart.

jwelch1324 commented on May 21, 2024

latest is default, there is no need to specify it, it just a default tag by Docker specification

Understood -- what I meant was that the digest listed on docker hub going many versions back is identical... the change doesn't occur until the 0.6.3 -> 0.6.4 transition (and the digest I pulled 2 weeks ago isn't around anymore 🤷 )

Notice the digest is 10bd4c249f59 for several versions prior to 0.6.4 -- this is all from https://hub.docker.com/r/jinaai/discoart/tags?page=1&ordering=last_updated

either way I just confirmed that pulling 0.6.3 directly and running the query there does not result in the OOM error.

This is the comparison between 0.6.3 and 0.6.4, i don't see a significant change here

This is very strange -- because there is definitely a difference in the user experience -- in 0.6.3 and prior the following header is not visible in the jupyter notebook while the query is running

-- all that is shown in prior versions is the progress bar and the docarray name.

also, could you show the full error trace of OOM, i'd like to see where it comes from?

Ahh yes sorry I should have done that at the start -- here is the stacktrace

2022-07-18 18:27:48,370 - discoart - ERROR - CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 5.93 GiB total capacity; 5.17 GiB already allocated; 3.31 MiB free; 5.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/discoart/create.py", line 176, in create
    do_run(_args, (model, diffusion, clip_models, secondary_model), device=device)
  File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 336, in do_run
    for j, sample in enumerate(samples):
  File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 897, in ddim_sample_loop_progressive
    out = sample_fn(
  File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 674, in ddim_sample
    out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
  File "/root/.cache/discoart/guided_diffusion/guided_diffusion/respace.py", line 102, in condition_score
    return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
  File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 399, in condition_score
    eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
  File "/root/.cache/discoart/guided_diffusion/guided_diffusion/respace.py", line 128, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 214, in cond_fn
    model_stat['clip_model'].encode_image(clip_in).float()
  File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 435, in encode_image
    return self.visual(image)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 187, in forward
    x = self.layer3(x)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 58, in forward
    out = self.bn3(self.conv3(out))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 5.93 GiB total capacity; 5.17 GiB already allocated; 3.31 MiB free; 5.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from discoart.

hanxiao commented on May 21, 2024

besides OOM, there was a serious bug in Dockerfile, which results that any Docker image version before 0.6.3 was stuck at 0.0.x version.

You can pull any Docker image before 0.6.3 and inside it your run

import discoart
print(discoart.__version__)

it will print 0.0.x

from discoart.

jwelch1324 commented on May 21, 2024

besides OOM, there was a serious bug in Dockerfile, which results that any Docker image version before 0.6.3 was stuck at 0.0.x version.

This certainly explains why the docker tags have the same digest I suspect.

Ok so the change that led to the OOM here is probably from a much earlier version -- while I would like the ability to use my smaller GPU for small image experiments up to 512x512 and leave the larger GPUs for high res renderings -- if whatever change occurred makes the 6GB cards too outdated to use with the system I am also ok that answer.

from discoart.

hanxiao commented on May 21, 2024

yesterday i was digging this issue for 4 hours and i found the default setting of create() but use_secondary_model=False begins to OOM on the latest version. I finally pinpointed the issue to 0.2.0 and OOM starts to occur on 0.2.2 (0.2.1 is broken).

The result is a big twist: before 0.2.0 there was a bug where setting use_secondary_model=False does not disable secondary model, it still uses the secondary model instead. So not being OOM before 0.2.0 was a bug, and being OOM after that is the correct behavior.

FYI, the name use_secondary_model is a bit misleading
as it replaces the calculation of p_mean_variance with a smaller model, and p_mean_variance is about get p(x_{t-1} | x_t), as well as a prediction of the initial x, x_0, which is pretty computational intensive procedure
so roughly speaking, use_secondary_model = approximate p_mean_variance with a smaller model.
hence, not turning on use_secondary_model does not mean you save computation, but instead introduce more computation!

from discoart.

jwelch1324 commented on May 21, 2024

Ok so this is good to know! -- Also thanks for the insight into what use_secondary_model actually does :D

Based on what you are saying though I think it implies that in general 6GB of VRAM is going to be insufficient moving forward for 512x512 resolution (using the clip models I have shown here -- maybe others work I haven't tried) -- is this correct? I tried changing the setting of use_secondary_model in my parameters (it was True so i set it to False) and I still get an OOM error though it occurs in a different code location -- it is inside the unet.py module now

 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/dist-packages/discoart/create.py", line 176, in create
     do_run(_args, (model, diffusion, clip_models, secondary_model), device=device)
   File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 338, in do_run
     for j, sample in enumerate(samples):
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 897, in ddim_sample_loop_progressive
     out = sample_fn(
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 674, in ddim_sample
     out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 102, in condition_score
     return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 399, in condition_score
     eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 128, in __call__
     return self.model(x, new_ts, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 194, in cond_fn
     out = diffusion.p_mean_variance(
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 91, in p_mean_variance
     return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 260, in p_mean_variance
     model_output = model(x, self._scale_timesteps(t), **model_kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 128, in __call__
     return self.model(x, new_ts, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 661, in forward
     h = module(h, emb)
   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 77, in forward
     x = layer(x)
   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 297, in forward
     return checkpoint(self._forward, (x,), self.parameters(), self.use_checkpoint)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/nn.py", line 138, in checkpoint
     return func(*inputs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 303, in _forward
     h = self.attention(qkv)
   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 352, in forward
     weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
 RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 5.93 GiB total capacity; 5.24 GiB already allocated; 9.31 MiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from discoart.

hanxiao commented on May 21, 2024

What i tried to express here is use_secondary_model=True uses higher memory than use_secondary_model=False

from discoart.

jwelch1324 commented on May 21, 2024

What i tried to express here is use_secondary_model=True uses higher memory than use_secondary_model=False

Understood -- however even with it set to False (on the latest verion) a different OOM error is raised -- I was not sure then if the secondary model aspect was related to the new OOM error?

from discoart.

hanxiao commented on May 21, 2024

i just find a point of potential improvement in p_mean_variance implementation, which can be used to reduce VRAM footprint, will work on this tmr

from discoart.

hanxiao commented on May 21, 2024

btw, my finding is irrelevant to OP, 6GB VRAM is nonetheless too small 😅

from discoart.

CUDA memory error since last update about discoart HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs