Comments (13)
found it!
from discoart.
On Dockerhub the digests are all the same up to 6.3 -- so everything up to 6.3 presumably works as well --
Can confirm that tag 0.6.4 is where the change occurred that is causing the OOM error with these params
from discoart.
but oddly I cannot find the tag of an image on dockerhub that has this hash anymore.
latest
is default, there is no need to specify it, it just a default tag by Docker specification
Can confirm that tag 0.6.4 is where the change occurred that is causing the OOM error with these params
This is the comparison between 0.6.3 and 0.6.4, i don't see a significant change here
from discoart.
also, could you show the full error trace of OOM, i'd like to see where it comes from?
from discoart.
latest is default, there is no need to specify it, it just a default tag by Docker specification
Understood -- what I meant was that the digest listed on docker hub going many versions back is identical... the change doesn't occur until the 0.6.3 -> 0.6.4 transition (and the digest I pulled 2 weeks ago isn't around anymore 🤷 )
Notice the digest is 10bd4c249f59
for several versions prior to 0.6.4 -- this is all from https://hub.docker.com/r/jinaai/discoart/tags?page=1&ordering=last_updated
either way I just confirmed that pulling 0.6.3 directly and running the query there does not result in the OOM error.
This is the comparison between 0.6.3 and 0.6.4, i don't see a significant change here
This is very strange -- because there is definitely a difference in the user experience -- in 0.6.3 and prior the following header is not visible in the jupyter notebook while the query is running
-- all that is shown in prior versions is the progress bar and the docarray name.
also, could you show the full error trace of OOM, i'd like to see where it comes from?
Ahh yes sorry I should have done that at the start -- here is the stacktrace
2022-07-18 18:27:48,370 - discoart - ERROR - CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 5.93 GiB total capacity; 5.17 GiB already allocated; 3.31 MiB free; 5.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/discoart/create.py", line 176, in create
do_run(_args, (model, diffusion, clip_models, secondary_model), device=device)
File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 336, in do_run
for j, sample in enumerate(samples):
File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 897, in ddim_sample_loop_progressive
out = sample_fn(
File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 674, in ddim_sample
out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
File "/root/.cache/discoart/guided_diffusion/guided_diffusion/respace.py", line 102, in condition_score
return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
File "/root/.cache/discoart/guided_diffusion/guided_diffusion/gaussian_diffusion.py", line 399, in condition_score
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
File "/root/.cache/discoart/guided_diffusion/guided_diffusion/respace.py", line 128, in __call__
return self.model(x, new_ts, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 214, in cond_fn
model_stat['clip_model'].encode_image(clip_in).float()
File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 435, in encode_image
return self.visual(image)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 187, in forward
x = self.layer3(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/open_clip/model.py", line 58, in forward
out = self.bn3(self.conv3(out))
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 5.93 GiB total capacity; 5.17 GiB already allocated; 3.31 MiB free; 5.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from discoart.
besides OOM, there was a serious bug in Dockerfile, which results that any Docker image version before 0.6.3 was stuck at 0.0.x version.
You can pull any Docker image before 0.6.3 and inside it your run
import discoart
print(discoart.__version__)
it will print 0.0.x
from discoart.
besides OOM, there was a serious bug in Dockerfile, which results that any Docker image version before 0.6.3 was stuck at 0.0.x version.
This certainly explains why the docker tags have the same digest I suspect.
Ok so the change that led to the OOM here is probably from a much earlier version -- while I would like the ability to use my smaller GPU for small image experiments up to 512x512 and leave the larger GPUs for high res renderings -- if whatever change occurred makes the 6GB cards too outdated to use with the system I am also ok that answer.
from discoart.
yesterday i was digging this issue for 4 hours and i found the default setting of create()
but use_secondary_model=False
begins to OOM on the latest version. I finally pinpointed the issue to 0.2.0
and OOM starts to occur on 0.2.2
(0.2.1
is broken).
The result is a big twist: before 0.2.0
there was a bug where setting use_secondary_model=False
does not disable secondary model, it still uses the secondary model instead. So not being OOM before 0.2.0 was a bug, and being OOM after that is the correct behavior.
FYI, the name use_secondary_model
is a bit misleading
as it replaces the calculation of p_mean_variance
with a smaller model, and p_mean_variance
is about get p(x_{t-1} | x_t)
, as well as a prediction of the initial x, x_0
, which is pretty computational intensive procedure
so roughly speaking, use_secondary_model = approximate p_mean_variance with a smaller model
.
hence, not turning on use_secondary_model
does not mean you save computation, but instead introduce more computation!
from discoart.
Ok so this is good to know! -- Also thanks for the insight into what use_secondary_model
actually does :D
Based on what you are saying though I think it implies that in general 6GB of VRAM is going to be insufficient moving forward for 512x512 resolution (using the clip models I have shown here -- maybe others work I haven't tried) -- is this correct? I tried changing the setting of use_secondary_model
in my parameters (it was True
so i set it to False
) and I still get an OOM error though it occurs in a different code location -- it is inside the unet.py
module now
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/discoart/create.py", line 176, in create
do_run(_args, (model, diffusion, clip_models, secondary_model), device=device)
File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 338, in do_run
for j, sample in enumerate(samples):
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 897, in ddim_sample_loop_progressive
out = sample_fn(
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 674, in ddim_sample
out = self.condition_score(cond_fn, out_orig, x, t, model_kwargs=model_kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 102, in condition_score
return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 399, in condition_score
eps = eps - (1 - alpha_bar).sqrt() * cond_fn(
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 128, in __call__
return self.model(x, new_ts, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/discoart/runner.py", line 194, in cond_fn
out = diffusion.p_mean_variance(
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 91, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/gaussian_diffusion.py", line 260, in p_mean_variance
model_output = model(x, self._scale_timesteps(t), **model_kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/respace.py", line 128, in __call__
return self.model(x, new_ts, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 661, in forward
h = module(h, emb)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 77, in forward
x = layer(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 297, in forward
return checkpoint(self._forward, (x,), self.parameters(), self.use_checkpoint)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/nn.py", line 138, in checkpoint
return func(*inputs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 303, in _forward
h = self.attention(qkv)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/guided_diffusion/unet.py", line 352, in forward
weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 5.93 GiB total capacity; 5.24 GiB already allocated; 9.31 MiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from discoart.
What i tried to express here is use_secondary_model=True
uses higher memory than use_secondary_model=False
from discoart.
What i tried to express here is
use_secondary_model=True
uses higher memory thanuse_secondary_model=False
Understood -- however even with it set to False
(on the latest verion) a different OOM error is raised -- I was not sure then if the secondary model aspect was related to the new OOM error?
from discoart.
i just find a point of potential improvement in p_mean_variance
implementation, which can be used to reduce VRAM footprint, will work on this tmr
from discoart.
btw, my finding is irrelevant to OP, 6GB VRAM is nonetheless too small 😅
from discoart.
Related Issues (20)
- create() error HOT 2
- Can AMD graphics cards work? HOT 1
- Unicode prompt problem running on Windows
- Can the image be generated to the specified folder? HOT 1
- About the role of PLMS HOT 1
- Preview missing in Google Colab HOT 2
- What is the minimum VRAM memory and whether 8g can run
- RuntimeError: CUDA out of memory. Tried to allocate 2.81 GiB HOT 5
- NaN detected in grad at the diffusion inner-step... HOT 1
- Docker instructions come up with blank Jupyter with no notebooks available HOT 4
- Issue inside playbook for Dockerfile
- `docstrings.yml` contains non-ASCII characters and `config.py` may fail to read it
- Running on M1 mac HOT 4
- RuntimeError: "clamp_min_cpu" not implemented for "Half" HOT 4
- 在Docker运行,使用 Jupyter 笔记本中执行da = create() (Run in Docker with da = create() in Jupyter notebook) HOT 1
- good work, can you support fp16 inference?
- Images have compression artifacts
- Can I login to Jina and save all my Doc IDs to Jina? HOT 2
- feat: Can we add a GUI Interface for this project? HOT 2
- When will discord link be updated? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from discoart.