kakaobrain / karlo Goto Github PK

License: Other

Python 99.24% Shell 0.76%

karlo's Introduction

Karlo-v1.0.alpha on COYO-100M and CC15M

Karlo is a text-conditional image generation model based on OpenAI's unCLIP architecture with the improvement over the standard super-resolution model from 64px to 256px, recovering high-frequency details only in the small number of denoising steps.

"a portrait of an old monk, highly detailed."

"Photo of a business woman, silver hair"

"A teddy bear on a skateboard, children drawing style."

"Goryeo celadon in the shape of bird"

This alpha version of Karlo is trained on 115M image-text pairs, including COYO-100M high-quality subset, CC3M, and CC12M. For those who are interested in a better version of Karlo trained on more large-scale high-quality datasets, please visit the landing page of our application B^DISCOVER.

Updates

[2022-12-01] Karlo-v1.0.alpha is released!
[2022-12-19] Karlo-v1.0.alpha was integrated into the 🧨 diffusers library
[2022-12-20] Karlo-v1.0.alpha was integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:

Model Architecture

Overview

Karlo is a text-conditional diffusion model based on unCLIP, composed of prior, decoder, and super-resolution modules. In this repository, we include the improved version of the standard super-resolution module for upscaling 64px to 256px only in 7 reverse steps, as illustrated in the figure below:

In specific, the standard SR module trained by DDPM objective upscales 64px to 256px in the first 6 denoising steps based on the respacing technique. Then, the additional fine-tuned SR module trained by VQ-GAN-style loss performs the final reverse step to recover high-frequency details. We observe that this approach is very effective to upscale the low-resolution in a small number of reverse steps.

Details

We train all components from scratch on 115M image-text pairs including COYO-100M, CC3M, and CC12M. In the case of Prior and Decoder, we use ViT-L/14 provided by OpenAI’s CLIP repository. Unlike the original implementation of unCLIP, we replace the trainable transformer in the decoder into the text encoder in ViT-L/14 for efficiency. In the case of the SR module, we first train the model using the DDPM objective in 1M steps, followed by additional 234K steps to fine-tune the additional component. The table below summarizes the important statistics of our components:

	Prior	Decoder	SR
CLIP	ViT-L/14	ViT-L/14	-
#param	1B	900M	700M + 700M
#optimization steps	1M	1M	1M + 0.2M
#sampling steps	25	50 (default), 25 (fast)	7
Checkpoint links	ViT-L-14, ViT-L-14 stats, model	model	model

In the checkpoint links, ViT-L-14 is equivalent to the original version, but we include it for convenience. We also remark that ViT-L-14-stats is required to normalize the outputs of the prior module.

Evaluation

We quantitatively measure the performance of Karlo-v1.0.alpha in the validation split of CC3M and MS-COCO. The table below presents CLIP-score and FID. To measure FID, we resize the image of the shorter side to 256px, followed by cropping it at the center. We set classifier-free guidance scales for prior and decoder to 4 and 8 in all cases. We observe that our model achieves reasonable performance even with 25 sampling steps of decoder.

CC3M

Sampling step	CLIP-s (ViT-B/16)	FID (13k from val)
Prior (25) + Decoder (25) + SR (7)	0.3081	14.37
Prior (25) + Decoder (50) + SR (7)	0.3086	13.95

MS-COCO

Sampling step	CLIP-s (ViT-B/16)	FID (30k from val)
Prior (25) + Decoder (25) + SR (7)	0.3192	15.24
Prior (25) + Decoder (50) + SR (7)	0.3192	14.43

For more information, please refer to the upcoming technical report.

🧨 Diffusers integration

Our unCLIP implemenetation is officially integrated in the 🧨 diffusers library

#Requisits to run Karlo unCLIP on diffusers
pip install diffusers transformers accelerate safetensors

from diffusers import UnCLIPPipeline
import torch

pipe = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=torch.float16)
pipe = pipe.to('cuda')

prompt = "a high-resolution photograph of a big red frog on a green leaf."
image = pipe(prompt).images[0]
image.save("./frog.png")

Check out the diffusers docs for the full usage of the unCLIPPipeline

Environment Setup

We use a single V100 of 32GB VRAM for sampling under PyTorch >= 1.10 and CUDA >= 11. The following commands install additional python packages and get pretrained model checkpoints. Or, you can simply install the package and download the weights via setup.sh

Additional python packages

pip install -r requirements.txt

Model checkpoints

wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/096db1af569b284eb76b3881534822d9/ViT-L-14.pt -P $KARLO_ROOT_DIR  # same with the official ViT-L/14 from OpenAI
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/0b62380a75e56f073e2844ab5199153d/ViT-L-14_stats.th -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/efdf6206d8ed593961593dc029a8affa/decoder-ckpt-step%3D01000000-of-01000000.ckpt -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/85626483eaca9f581e2a78d31ff905ca/prior-ckpt-step%3D01000000-of-01000000.ckpt -P $KARLO_ROOT_DIR
wget https://arena.kakaocdn.net/brainrepo/models/karlo-public/v1.0.0.alpha/4226b831ae0279020d134281f3c31590/improved-sr-ckpt-step%3D1.2M.ckpt -P $KARLO_ROOT_DIR

Sampling

Gradio demo (T2I and Image variation)

The following command launches gradio demo for text-to-image generation and image variation. We notice that the second run in the gradio is unexpectedly slower than the usual case in PyTorch>=1.12. We guess that this happens because launching the cuda kernels takes some time, usually up to 2 minutes.

python demo/product_demo.py --host 0.0.0.0 --port $PORT --root-dir $KARLO_ROOT_DIR

Samples below are non-cherry picked T2I and image variation examples of random seed 0. In each case, the first row shows T2I samples and the second shows the image variation samples of the leftmost image in the first row.

[T2I + Image variation] "A man with a face of avocado, in the drawing style of Rene Magritte."

[T2I + Image variation] "a black porcelain in the shape of pikachu"

T2I command line example

Here, we include the command line example of T2I. For image variation, you can refer to karlo/sampler/i2i.py on how to replace the prior into the clip image feature.

python example.py --root-dir=$KARLO_ROOT_DIR \
                  --prompt="A man with a face of avocado, in the drawing style of Rene Magritte" \
                  --output-dir=$OUTPUT_DIR \
                  --max-bsz=2 \
                  --sampling-type=fast

Licence and Disclaimer

This project including the weights are distributed under CreativeML Open RAIL-M license, equivalent version of Stable Diffusion v1. You may use this model in commercial applications, but it is highly recommended to adopt a powerful safe checker as a post-processing. We also remark that we are not responsible for any kinds of use of the generated images.

BibTex

If you find this repository useful in your research, please cite:

@misc{kakaobrain2022karlo-v1-alpha,
  title         = {Karlo-v1.0.alpha on COYO-100M and CC15M},
  author        = {Donghoon Lee, Jiseob Kim, Jisu Choi, Jongmin Kim, Minwoo Byeon, Woonhyuk Baek and Saehoon Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/karlo}},
}

Acknowledgement

We deeply appreciate all the contributors to OpenAI’s Guided-Diffusion project.
We also greatly appreciate Apolinário Passos and Will Berman from Huggingface for integrating this model to diffusers.

Contact

If you would like to collaborate with us or share a feedback, please e-mail to us, [email protected]

karlo's People

Contributors

Stargazers

Watchers

karlo's Issues

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). And anybody kown how to generate on multiple GPUs?

When I use
CUDA_VISIBLE_DEVICES=0 python demo/product_demo.py --host 0.0.0.0 --port 9870 --root-dir ./models/
(env:python3.10.8 torch==1.13.0+cu116)
and I entered the gradio UI, but when I generate, the error below happend.

But when I change another env:python3.8.13 torch==1.12.1+cu113
It works but OOM...... Seems 24Gb not enough for him......
U kown I have four Nvidia3090, so I want to know if there is any way to generate on multiple GPUs?

/root/anaconda3/envs/py310/lib/python3.10/site-packages/torch/serialization.py:779: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
INFO:root:Loading prior: prior-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading decoder: decoder-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading SR(64->256): improved-sr-ckpt-step=1.2M.ckpt
INFO:root:done.
Running on local URL: http://0.0.0.0:9870

To create a public link, set `share=True` in `launch()`.

text_input: a dog
prior_sm: 25
prior_cf_scale: 4
decoder_sm: 25
decoder_cf_scale: 8
sr_sm: 7
seed: 0
max_bsz: 4
Exception in thread Thread-2 (_sample):
Traceback (most recent call last):
File "/root/anaconda3/envs/py310/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/root/anaconda3/envs/py310/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/tongange/karlo-main/demo/components.py", line 171, in _sample
for k, out in enumerate(output_generator):
File "/home/tongange/karlo-main/karlo/sampler/t2i.py", line 109, in call
img_feat = self._prior(
File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tongange/karlo-main/karlo/models/prior_model.py", line 120, in forward
sample = sample_fn(
File "/home/tongange/karlo-main/karlo/modules/diffusion/gaussian_diffusion.py", line 533, in p_sample_loop
for sample in self.p_sample_loop_progressive(
File "/home/tongange/karlo-main/karlo/modules/diffusion/gaussian_diffusion.py", line 584, in p_sample_loop_progressive
out = self.p_sample(
File "/home/tongange/karlo-main/karlo/modules/diffusion/gaussian_diffusion.py", line 483, in p_sample
out = self.p_mean_variance(
File "/home/tongange/karlo-main/karlo/modules/diffusion/respace.py", line 97, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "/home/tongange/karlo-main/karlo/modules/diffusion/gaussian_diffusion.py", line 338, in p_mean_variance
model_output = model(x, t, **model_kwargs)
File "/home/tongange/karlo-main/karlo/modules/diffusion/respace.py", line 108, in wrapped
x, self.timestep_map[ts].to(device=ts.device, dtype=ts.dtype), **kwargs
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

classifier-free guidance

Hi, thanks for your great work!
How to add classifier-free guidance during training?

sampling steps

Can I try more different diffusion steps?

Code for training

Will there be code to finetune the pretrained models ? :)

(에러) CUDA out of memory. 문의 드립니다.

초보자이므로 문의 내용이 불충분하더라도 양해 부탁드립니다.
관련 예제 실행하는 중에 "CUDA out of memory." 에러가 발생하여 문의 드리며, 해결방법 조언해 주시면 고맙겠습니다.

(실행 환경)

-ubuntu 18.04
-GPU (nviDIA GeForce RTX 3090(Memory : 24G))
-python 3.8
-black 22.6.0
-pytorch 1.10.0 / 1.12.1 (2개버전 각각 진행 - 결과 동일)
-torchvision 0.11.0 / 0.13.1 (2개버전 각각 진행 - 결과 동일)
-einops 0.6.0
-omegaconf 2.2.3
-matplotlib 3.3.4
-gradio 3.12.0

(예제 실행)

> python demo/product_demo.py --host 127.0.0.1 --port 6021 --root-dir /home/jyseo/project/kakaobrain_karlo/karlo

-(예제 결과)
/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/serialization.py:707: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
INFO:root:Loading prior: prior-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading decoder: decoder-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading SR(64->256): improved-sr-ckpt-step=1.2M.ckpt
INFO:root:done.
Running on local URL: http://127.0.0.1:6021

To create a public link, set share=Ture in launch().

(에러)

text_input: a black porcelain in the shape of pikachu
prior_sm: 25
prior_cf_scale: 4
decoder_sm: 25
decoder_cf_scale: 8
sr_sm: 7
seed: 0
max_bsz: 4
Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/demo/components.py", line 171, in _sample
for k, out in enumerate(output_generator):
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/sampler/t2i.py", line 150, in call
for k, out in enumerate(images_256_outputs):
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/models/sr_64_256.py", line 86, in forward
for x in sample_outputs:
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/diffusion/gaussian_diffusion.py", line 631, in p_sample_loop_progressive_for_improved_sr
out = self.p_sample(
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/diffusion/gaussian_diffusion.py", line 483, in p_sample
out = self.p_mean_variance(
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/diffusion/respace.py", line 97, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/diffusion/gaussian_diffusion.py", line 338, in p_mean_variance
model_output = model(x, t, **model_kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/diffusion/respace.py", line 107, in wrapped
return model(
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/unet.py", line 691, in forward
return super().forward(x, timesteps, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/unet.py", line 665, in forward
h = module(h, emb)
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/unet.py", line 44, in forward
x = layer(x, emb)
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/unet.py", line 223, in forward
h = self.out_layers(h)
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jyseo/project/kakaobrain_karlo/karlo/karlo/modules/nn.py", line 18, in forward
y = super().forward(x.float()).to(x.dtype)
RuntimeError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.70 GiB total capacity; 20.26 GiB already allocated; 635.88 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(에러메시지)

RuntimeError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.70 GiB total capacity; 20.26 GiB already allocated; 635.88 MiB free; 21.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(조치 사항) --> karlo/sampler/t2i.py 코드 수정 (기존)[256, 256], --> (변경)[128, 128],

        """ Upsample 64x64 to 256x256 """
        images_256 = TVF.resize(
            images_64,
            #(기존)[256, 256],  
            #(변경)
            [128, 128],
            interpolation=InterpolationMode.BICUBIC,
            antialias=True,
        )

### (조치 사항 결과) -->### 에러 없으나, 결과물은 미완성

python demo/product_demo.py --host 127.0.0.1 --port 6023 --root-dir /home/jyseo/project/kakaobrain_karlo/karlo
/home/jyseo/miniconda2/envs/mapnet_py38/lib/python3.8/site-packages/torch/serialization.py:707: UserWarning: 'torch.load' received a zip file that looks like a TorchScript archive dispatching to 'torch.jit.load' (call 'torch.jit.load' directly to silence this warning)
warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
INFO:root:Loading prior: prior-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading decoder: decoder-ckpt-step=01000000-of-01000000.ckpt
INFO:root:done.
INFO:root:Loading SR(64->256): improved-sr-ckpt-step=1.2M.ckpt
INFO:root:done.
Running on local URL: http://127.0.0.1:6023

To create a public link, set share=True in launch().

text_input: a black porcelain in the shape of pikachu
prior_sm: 25
prior_cf_scale: 4
decoder_sm: 25
decoder_cf_scale: 8
sr_sm: 7
seed: 0
max_bsz: 4
INFO:root:Generation done. a black porcelain in the shape of pikachu -- 7.740963secs

낮은 퀄리티 출력됨.

any details about COYO-100M high-quality subset?

Thanks for great work!
I am curious about COYO-100M high-quality subset.
Are images selected with image resolution? or any other metric such as CLIP score?

Can anybody help add the I2I function with text text describe.

I noticed that in i2i.py, the function can receive prompt, but in the gradio ui, there only picture token.
I tried to add the args for that, but I didn't good at gradio website and failed, maybe anybody can help that?

And batch_size not a good option to choose, only one batch size need lots of memories of CUDA even 23GB at NVIDIA 3090.
If there can be a batch count option, the model canbe perfect......

Dynamic Height / Width

Fantastic library - really appreciate all the work you all have done to provide such an amazing tool.

I have been poking around a bit with image interpolation and was curious if there was a path to using this model as a means of generating images of various sizes (instead of just 256x256).

I thought that I would be able to just hard code a few parameters (e.g. decoder_latents, super_res_latents), but, when I do this, I get something along the lines of:

Internal server error with unclip_images: Unexpected latents shape, got torch.Size([12, 3, 512, 512]), expected (12, 3, 256, 256)

This is due to the fact that what you pass in is expect to match the these are expected to match the UNet2DModels passed into super_res_first and super_res_last. This leads me to believe that I must be misunderstanding something for the fact that it's unclear why these parameters would even be included if they are just going to be checked against the related models, anyhow.

Any insight here is greatly appreciated.

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I set os.environ["CUDA_VISIBLE_DEVICES"]="1" in components.py to use my second gpu and when I try to create a image via the gradio interface I get:
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

set CUDA_VISIBLE_DEVICES=1
in the Anaconda prompt terminal before running the script doesn't fix the error.

And torch.cuda.set_device(1) after the torch import causes
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

I've changed cpu to "cuda" or cuda() where "cpu" or cpu() was in the repo files, but the error is still thrown when I try to run the model
(gradio started via python demo/product_demo.py --host 0.0.0.0 --port 8085 --root-dir .)

Any suggestions as to what I should change to get everything on the specified GPU?
Suggested solutions that would fit in under 24GB of VRAM?

edit:
I tried

set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'

with

timestep_map_tensor = th.tensor(timestep_map)
cuda_device = th.device("cuda")
timestep_map_tensor = timestep_map_tensor.to(cuda_device)
self.register_buffer("timestep_map", timestep_map_tensor, persistent=False)

in respace.py, and was able to load things, and the image started generating :)
but ran out of vram

Full output:

Exception in thread Thread-2 (_sample):
Traceback (most recent call last):
  File "C:\Users\Jason\.conda\envs\karlo\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\Jason\.conda\envs\karlo\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\demo\components.py", line 174, in _sample
    for k, out in enumerate(output_generator):
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\sampler\t2i.py", line 116, in __call__
    img_feat = self._prior(
  File "C:\Users\Jason\.conda\envs\karlo\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\models\prior_model.py", line 127, in forward
    sample = sample_fn(
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\gaussian_diffusion.py", line 533, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\gaussian_diffusion.py", line 584, in p_sample_loop_progressive
    out = self.p_sample(
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\gaussian_diffusion.py", line 483, in p_sample
    out = self.p_mean_variance(
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\respace.py", line 97, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\gaussian_diffusion.py", line 338, in p_mean_variance
    model_output = model(x, t, **model_kwargs)
  File "C:\Users\Jason\Documents\machine_learning\image_ML\karlo\karlo\modules\diffusion\respace.py", line 108, in wrapped
    x, self.timestep_map[ts].to(device=ts.device, dtype=ts.dtype), **kwargs
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

when i use the pipeline example, i got this error

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

NaN outputs

Hi!

Thanks a lot for your work on this, it's great.

However, I'm having trouble running the example.py script. All I get are tensors full of nans returned on line 66 of example.py. Did you see such an error before, and have an idea how I might be able to fix this?

Thanks!

Licence update request

Could this clause "You shall undertake reasonable efforts to use the latest version of the Model." be removed from the license?
Ideally people could use the version of the model that they find most useful/the best for their purposes.

Perhaps the newer CreativeML Open RAIL++-M License could be used that does not contain that clause.
https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL

And perhaps MIT on the surrounding code, or explicit permission to modify the code.