GithubHelp home page GithubHelp logo

afiaka87 / clip-guided-diffusion Goto Github PK

View Code? Open in Web Editor NEW
446.0 12.0 62.0 52.43 MB

A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.

License: MIT License

Jupyter Notebook 13.59% Python 86.41%
multimodal image-generation text-to-image-synthesis text-to-image openai openai-clip deep-learning artificial-intelligence diffusion multimodality

clip-guided-diffusion's Introduction

CLIP Guided Diffusion

From @crowsonkb.

Disclaimer: I'm redirecting efforts to pyglide and may be slow to address bugs here.

I also recommend looking at @crowsonkb's v-diffusion-pytorch.

See captions and more generations in the Gallery.

Install

git clone https://github.com/afiaka87/clip-guided-diffusion.git
cd clip-guided-diffusion
git clone https://github.com/crowsonkb/guided-diffusion.git
pip3 install -e guided-diffusion
python3 setup.py install

Run

cgd -txt "Alien friend by Odilon Redo"

A gif of the full run will be saved to ./outputs/caption_{j}.gif by default.

Alien friend by Oidlon Redo

  • ./outputs will contain all intermediate outputs
  • current.png will contain the current generation.
  • (optional) Provide --wandb_project <project_name> to enable logging intermediate outputs to wandb. Requires free account. URL to run will be provided in CLI - example run
  • ~/.cache/clip-guided-diffusion/ will contain downloaded checkpoints from OpenAI/Katherine Crowson.

Usage - CLI

Text to image generation

--prompts / -txts --image_size / -size

cgd --image_size 256 --prompts "32K HUHD Mushroom"

32K HUHD Mushroom

Text to image generation (multiple prompts with weights)

  • multiple prompts can be specified with the | character.
  • you may optionally specify a weight for each prompt using a : character.
  • e.g. cgd --prompts "Noun to visualize:1.0|style:0.1|location:0.1|something you dont want:-0.1"
  • weights must not sum to 0

cgd -txt "32K HUHD Mushroom|Green grass:-0.1"

CPU

  • Using a CPU will take a very long time compared to using a GPU.

cgd --device cpu --prompt "Some text to be generated"

CUDA GPU

cgd --prompt "Theres no need to specify a device, it will be chosen automatically"

Iterations/Steps (Timestep Respacing)

--timestep_respacing or -respace (default: 1000)

  • Uses fewer timesteps over the same diffusion schedule. Sacrifices accuracy/alignment for quicker runtime.
  • options: - 25, 50, 150, 250, 500, 1000, ddim25,ddim50,ddim150, ddim250,ddim500,ddim1000
  • (default: 1000)
  • prepending a number with ddim will use the ddim scheduler. e.g. ddim25 will use the 25 timstep ddim scheduler. This method may be better at shorter timestep_respacing values.

Existing image

--init_image/-init

  • Blend an image with the diffusion for a number of steps.

--skip_timesteps/-skip

The number of timesteps to spend blending the image with the guided-diffusion samples. Must be less than --timestep_respacing and greater than 0. Good values using timestep_respacing of 1000 are 250 to 500.

  • -respace 1000 -skip 500
  • -respace 500 -skip 250
  • -respace 250 -skip 125
  • -respace 125 -skip 75

(optional)--init_scale/-is

To enable a VGG perceptual loss after the blending, you must specify an --init_scale value. 1000 seems to work well.

cgd --prompts "A mushroom in the style of Vincent Van Gogh" \
  --timestep_respacing 1000 \
  --init_image "images/32K_HUHD_Mushroom.png" \
  --init_scale 1000 \
  --skip_timesteps 350

Image size

  • options: 64, 128, 256, 512 pixels (square)
  • Note about 64x64 when using the 64x64 checkpoint, the cosine noise scheduler is used. For unclear reasons, this noise scheduler requires different values for --clip_guidance_scale and --tv_scale. I recommend starting with -cgs 5 -tvs 0.00001 and experimenting from around there. --clip_guidance_scale and --tv_scale will require experimentation.
  • For all other checkpoints, clip_guidance_scale seems to work well around 1000-2000 and tv_scale at 0, 100, 150 or 200
cgd --init_image=images/32K_HUHD_Mushroom.png \
    --skip_timesteps=500 \
    --image_size 64 \
    --prompt "8K HUHD Mushroom"

resized to 200 pixels for visibility

cgd --image_size 512 --prompt "8K HUHD Mushroom"

New: Non-square Generations (experimental) Generate portrait or landscape images by specifying a number to offset the width and/or height.

  • offset should be a multiple of 16 for image sizes 64x64, 128x128
  • offset should be a multiple of 32 for image sizes 256x256, 512x512
  • may cause NaN/Inf errors.
  • a positive offset will require more memory.
  • a negative offset uses less memory and is faster.
my_caption="a photo of beautiful green hills and a sunset, taken with a blackberry in 2004"
cgd --prompts "$my_caption" \
    --image_size 128 \
    --width_offset 32 

Full Usage - Python

# Initialize diffusion generator
from cgd import clip_guided_diffusion
import cgd_util

cgd_generator = clip_guided_diffusion(
    prompts=["an image of a fox in a forest"],
    image_prompts=["image_to_compare_with_clip.png"],
    batch_size=1,
    clip_guidance_scale=1500,
    sat_scale=0,
    tv_scale=150,
    init_scale=1000,
    range_scale=50,
    image_size=256,
    class_cond=False,
    randomize_class=False, # only works with class conditioned checkpoints
    cutout_power=1.0,
    num_cutouts=16,
    timestep_respacing="1000",
    seed=0,
    diffusion_steps=1000, # dont change this
    skip_timesteps=400,
    init_image="image_to_blend_and_compare_with_vgg.png",
    clip_model_name="ViT-B/16",
    dropout=0.0,
    device="cuda",
    prefix_path="store_images/",
    wandb_project=None,
    wandb_entity=None,
    progress=True,
)
prefix_path.mkdir(exist_ok=True)
list(enumerate(tqdm(cgd_generator))) # iterate over generator

Full Usage - CLI

usage: cgd [-h] [--prompts PROMPTS] [--image_prompts IMAGE_PROMPTS]
           [--image_size IMAGE_SIZE] [--init_image INIT_IMAGE]
           [--init_scale INIT_SCALE] [--skip_timesteps SKIP_TIMESTEPS]
           [--prefix PREFIX] [--checkpoints_dir CHECKPOINTS_DIR]
           [--batch_size BATCH_SIZE]
           [--clip_guidance_scale CLIP_GUIDANCE_SCALE] [--tv_scale TV_SCALE]
           [--range_scale RANGE_SCALE] [--sat_scale SAT_SCALE] [--seed SEED]
           [--save_frequency SAVE_FREQUENCY]
           [--diffusion_steps DIFFUSION_STEPS]
           [--timestep_respacing TIMESTEP_RESPACING]
           [--num_cutouts NUM_CUTOUTS] [--cutout_power CUTOUT_POWER]
           [--clip_model CLIP_MODEL] [--uncond]
           [--noise_schedule NOISE_SCHEDULE] [--dropout DROPOUT]
           [--device DEVICE] [--wandb_project WANDB_PROJECT]
           [--wandb_entity WANDB_ENTITY] [--height_offset HEIGHT_OFFSET]
           [--width_offset WIDTH_OFFSET] [--use_augs] [--use_magnitude]
           [--quiet]

optional arguments:
  -h, --help            show this help message and exit
  --prompts PROMPTS, -txts PROMPTS
                        the prompt/s to reward paired with weights. e.g. 'My
                        text:0.5|Other text:-0.5' (default: )
  --image_prompts IMAGE_PROMPTS, -imgs IMAGE_PROMPTS
                        the image prompt/s to reward paired with weights. e.g.
                        'img1.png:0.5,img2.png:-0.5' (default: )
  --image_size IMAGE_SIZE, -size IMAGE_SIZE
                        Diffusion image size. Must be one of [64, 128, 256,
                        512]. (default: 128)
  --init_image INIT_IMAGE, -init INIT_IMAGE
                        Blend an image with diffusion for n steps (default: )
  --init_scale INIT_SCALE, -is INIT_SCALE
                        (optional) Perceptual loss scale for init image.
                        (default: 0)
  --skip_timesteps SKIP_TIMESTEPS, -skip SKIP_TIMESTEPS
                        Number of timesteps to blend image for. CLIP guidance
                        occurs after this. (default: 0)
  --prefix PREFIX, -dir PREFIX
                        output directory (default: outputs)
  --checkpoints_dir CHECKPOINTS_DIR, -ckpts CHECKPOINTS_DIR
                        Path subdirectory containing checkpoints. (default:
                        /home/samsepiol/.cache/clip-guided-diffusion)
  --batch_size BATCH_SIZE, -bs BATCH_SIZE
                        the batch size (default: 1)
  --clip_guidance_scale CLIP_GUIDANCE_SCALE, -cgs CLIP_GUIDANCE_SCALE
                        Scale for CLIP spherical distance loss. Values will
                        need tinkering for different settings. (default: 1000)
  --tv_scale TV_SCALE, -tvs TV_SCALE
                        Controls the smoothness of the final output. (default:
                        150.0)
  --range_scale RANGE_SCALE, -rs RANGE_SCALE
                        Controls how far out of RGB range values may get.
                        (default: 50.0)
  --sat_scale SAT_SCALE, -sats SAT_SCALE
                        Controls how much saturation is allowed. Used for
                        ddim. From @nshepperd. (default: 0.0)
  --seed SEED, -seed SEED
                        Random number seed (default: 0)
  --save_frequency SAVE_FREQUENCY, -freq SAVE_FREQUENCY
                        Save frequency (default: 1)
  --diffusion_steps DIFFUSION_STEPS, -steps DIFFUSION_STEPS
                        Diffusion steps (default: 1000)
  --timestep_respacing TIMESTEP_RESPACING, -respace TIMESTEP_RESPACING
                        Timestep respacing (default: 1000)
  --num_cutouts NUM_CUTOUTS, -cutn NUM_CUTOUTS
                        Number of randomly cut patches to distort from
                        diffusion. (default: 16)
  --cutout_power CUTOUT_POWER, -cutpow CUTOUT_POWER
                        Cutout size power (default: 1.0)
  --clip_model CLIP_MODEL, -clip CLIP_MODEL
                        clip model name. Should be one of: ('ViT-B/16',
                        'ViT-B/32', 'RN50', 'RN101', 'RN50x4', 'RN50x16') or a
                        checkpoint filename ending in `.pt` (default:
                        ViT-B/32)
  --uncond, -uncond     Use finetuned unconditional checkpoints from OpenAI
                        (256px) and Katherine Crowson (512px) (default: False)
  --noise_schedule NOISE_SCHEDULE, -sched NOISE_SCHEDULE
                        Specify noise schedule. Either 'linear' or 'cosine'.
                        (default: linear)
  --dropout DROPOUT, -drop DROPOUT
                        Amount of dropout to apply. (default: 0.0)
  --device DEVICE, -dev DEVICE
                        Device to use. Either cpu or cuda. (default: )
  --wandb_project WANDB_PROJECT, -proj WANDB_PROJECT
                        Name W&B will use when saving results. e.g.
                        `--wandb_project "my_project"` (default: None)
  --wandb_entity WANDB_ENTITY, -ent WANDB_ENTITY
                        (optional) Name of W&B team/entity to log to.
                        (default: None)
  --height_offset HEIGHT_OFFSET, -ht HEIGHT_OFFSET
                        Height offset for image (default: 0)
  --width_offset WIDTH_OFFSET, -wd WIDTH_OFFSET
                        Width offset for image (default: 0)
  --use_augs, -augs     Uses augmentations from the `quick` clip guided
                        diffusion notebook (default: False)
  --use_magnitude, -mag
                        Uses magnitude of the gradient (default: False)
  --quiet, -q           Suppress output. (default: False)

Development

git clone https://github.com/afiaka87/clip-guided-diffusion.git
cd clip-guided-diffusion
git clone https://github.com/afiaka87/guided-diffusion.git
python3 -m venv cgd_venv
source cgd_venv/bin/activate
pip install -r requirements.txt
pip install -e guided-diffusion

Run integration tests

  • Some tests require a GPU; you may ignore them if you dont have one.
python -m unittest discover

clip-guided-diffusion's People

Contributors

afiaka87 avatar gitter-badger avatar hiandrewquinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clip-guided-diffusion's Issues

Link to original works

The discovery of the original "cutouts" method goes back to The Big Sleep by advadvnoun.

Katherine Crowson @crowsonkb provided the implementation for handling the guided-diffusion scenario.

The original works from Katherine may be found at the following link. They include her other works with the VQGAN and the Discrete VAE from Open AI (used for discretization of vision tokens in DALL-E).

https://github.com/EleutherAI/vqgan-clip

Issue #20 still not working.

Still does not work. See the context in the original issue.

ResizeRight is expecting either a numpy array or a torch tensor, now it gets a PIL image which does not have shape attribute.

pil_img = Image.open(script_util.fetch(image)).convert('RGB')
smallest_side = min(diffusion_size, *pil_img.size)
pil_img = resize_right.resize(pil_img, out_shape=[smallest_side],
interp_method=lanczos3, support_sz=None,
antialiasing=True, by_convs=False, scale_tolerance=None)
batch = make_cutouts(tvf.to_tensor(pil_img).unsqueeze(0).to(device))

This is what I tried and at least it runs without an error

   t_img = tvf.to_tensor(pil_img)
   t_img = resize_right.resize(t_img, out_shape=(smallest_side, smallest_side),
                                 interp_method=lanczos3, support_sz=None,
                                 antialiasing=True, by_convs=False, scale_tolerance=None)
   batch = make_cutouts(t_img.unsqueeze(0).to(device)) 

I am not sure what was intended here as to the output shape. As it was, it made 1024x512 from 1024x1024 original, for image_size 512, now this makes 512x512.

I am not using offsets, BTW.

As to the images produced, can't see much happening when using image prompts, but I guess that is another story. According to my experience guidance by comparing CLIP encoded images is not very useful as such, so I'll probably go my own way to add other ways as to image based guidance. This might depend on the kind of images I work with and how. More visuality than semantics.

PS. I see now that the init image actually means using perceptual losses as guidance, rather than initialising something (like one can do with VQGAN latents for instance). So that's more like what I am after.

Originally posted by @htoyryla in #20 (comment)

image_prompts is mistakenly set to text prompts

Hi, thanks for putting this project out there, I am having fun playing with it. I am using it from the command line. I tried to set the --image_prompts argument but it would fail at the beginning. For example, my command would be:

cgd --image_prompts='images/32K_HUHD_Mushroom.png' --skip_timesteps=500 --image_size 256 --prompt "8K HUHD Mushroom"

And I'd get the output:

Given initial image: 
Using:
===
CLIP guidance scale: 1000 
TV Scale: 100.0
Range scale: 50.0
Dropout: 0.0.
Number of cutouts: 48 number of cutouts.
0it [00:00, ?it/s]
Using device cuda. You can specify a device manually with `--device/-dev`
0it [00:04, ?it/s]
/usr/lib/python3/dist-packages/apport/report.py:13: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import fnmatch, glob, traceback, errno, sys, atexit, locale, imp, stat
Traceback (most recent call last):
  File "/usr/local/bin/cgd", line 33, in <module>
    sys.exit(load_entry_point('cgd-pytorch==0.1.5', 'console_scripts', 'cgd')())
  File "/home/milhouse/.local/lib/python3.9/site-packages/cgd/cgd.py", line 385, in main
    list(enumerate(tqdm(cgd_generator))) # iterate over generator
  File "/home/milhouse/.local/lib/python3.9/site-packages/tqdm/std.py", line 1127, in __iter__
    for obj in iterable:
  File "/home/milhouse/.local/lib/python3.9/site-packages/cgd/cgd.py", line 167, in clip_guided_diffusion
    image_prompt, batched_weight = encode_image_prompt(img, weight, image_size, num_cutouts=num_cutouts, clip_model_name=clip_model_name, device=device)
  File "/home/milhouse/.local/lib/python3.9/site-packages/cgd/cgd.py", line 97, in encode_image_prompt
    pil_img = Image.open(fetch(image)).convert('RGB')
  File "/home/milhouse/.local/lib/python3.9/site-packages/cgd/cgd.py", line 76, in fetch
    return open(url_or_path, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '8K HUHD Mushroom'

So, on this line here:

image_prompts = args.prompts.split('|')

I think you meant to write: image_prompts = args.image_prompts.split('|')

That seemed to fix the problem for me.

TypeError got an unexpected keyword argument 'custom_classes'

Running on Windows we always get the following crash

Loading model from: C:\Program Files\Python39\lib\site-packages\lpips-0.1.4-py3.9.egg\lpips\weights\v0.1\vgg.pth
0it [00:07, ?it/s]
Traceback (most recent call last):
  File "C:\Program Files\Python39\Scripts\cgd-script.py", line 33, in <module>
    sys.exit(load_entry_point('cgd-pytorch==0.1.5', 'console_scripts', 'cgd')())
  File "C:\Program Files\Python39\lib\site-packages\cgd_pytorch-0.1.5-py3.9.egg\cgd\cgd.py", line 385, in main
    list(enumerate(tqdm(cgd_generator))) # iterate over generator
  File "C:\Users\david\AppData\Roaming\Python\Python39\site-packages\tqdm\std.py", line 1185, in __iter__
    for obj in iterable:
  File "C:\Program Files\Python39\lib\site-packages\cgd_pytorch-0.1.5-py3.9.egg\cgd\cgd.py", line 243, in clip_guided_diffusion
    cgd_samples = diffusion_sample_loop(
TypeError: ddim_sample_loop_progressive() got an unexpected keyword argument 'custom_classes'

At certain resolutions like 128,256,512 this crash occurs after all of the image generation iterations are completed, but before any files are saved.

Is it possible to either fix this or disable the LPIPS loss? This isn't actually required for the image generation right?

What's the meaning of this equation in cond_fn (from cgd.py)

In cgd.py, in cond_fn(x, t, out, y=None):

fac = diffusion.sqrt_one_minus_alphas_cumprod[current_timestep]
sigmas = 1 - fac
x_in = out["pred_xstart"] * fac + x * sigmas

out["pred_xstart"] is the predicted x0.
x is the current xt.

what the meaning of x_in?

AMD Pytorch ROCM support

@crowsonkb has informed me this should work but I can't currently test it as I don't have an AMD GPU.

I'd greatly appreciate it if someone with that type of setup could let me know what their experience is using this codebase.

Tensor is not a torch image

During the execution I get the following error:

TypeError: tensor is not a torch image.

MacBook-Pro-3 clip-guided-diffusion % cgd --prompts "A mushroom in the style of Vincent Van Gogh" \ 
  --timestep_respacing 1000 \
  --init_image "images/32K_HUHD_Mushroom.png" \
  --init_scale 1000 \
  --skip_timesteps 350
Using device cpu. You can specify a device manually with `--device/-dev`
--wandb_project not specified. Skipping W&B integration.
Loading clip model	ViT-B/32	on device	cpu.
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /Users/.cache/torch/hub/checkpoints/vgg16-397923af.pth
100%|███████████████████████████████████████████████████████████████████████████| 528M/528M [01:19<00:00, 6.96MB/s]
Loading model from: /Library/Python/3.8/site-packages/lpips-0.1.4-py3.8.egg/lpips/weights/v0.1/vgg.pth
  0%|                                                                                      | 0/650 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/cgd", line 33, in <module>
    sys.exit(load_entry_point('cgd-pytorch==0.2.5', 'console_scripts', 'cgd')())
  File "/Library/Python/3.8/site-packages/cgd_pytorch-0.2.5-py3.8.egg/cgd/cgd.py", line 357, in main
  File "/Library/Python/3.8/site-packages/cgd_pytorch-0.2.5-py3.8.egg/cgd/cgd.py", line 223, in clip_guided_diffusion
  File "/Users/Developement/dream-visual/clip-guided-diffusion/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 637, in p_sample_loop_progressive
    out = sample_fn(
  File "/Users/Developement/dream-visual/clip-guided-diffusion/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 522, in p_sample_with_grad
    out["mean"] = self.condition_mean_with_grad(
  File "/Users/Developement/dream-visual/clip-guided-diffusion/guided-diffusion/guided_diffusion/gaussian_diffusion.py", line 380, in condition_mean_with_grad
    gradient = cond_fn(x, t, p_mean_var, **model_kwargs)
  File "/Library/Python/3.8/site-packages/cgd_pytorch-0.2.5-py3.8.egg/cgd/cgd.py", line 150, in cond_fn
  File "/Library/Python/3.8/site-packages/torchvision-0.2.2.post3-py3.8.egg/torchvision/transforms/transforms.py", line 163, in __call__
    return F.normalize(tensor, self.mean, self.std, self.inplace)
  File "/Library/Python/3.8/site-packages/torchvision-0.2.2.post3-py3.8.egg/torchvision/transforms/functional.py", line 201, in normalize
    raise TypeError('tensor is not a torch image.')
TypeError: tensor is not a torch image.

GIFs are pixelated

Originally went with GIF as it meant not placing a dependency on ffmpeg. The outputs aren't very good quality for whatever reason. Rather than mess with fixing an outdated tech; I'm just going to require ffmpeg to be installed locally on your machine. Perhaps with a message to the user if they don't have the binary on their PATH.

Noisy outputs

Generations with the colab notebook are currently quite noisy. I'm looking into this. For now; it's best to just use one of Katherine's official notebooks.

how to use cgd_util.txt_to_dir?

Hi, I'm trying to use cgd_util.txt_to_dir in my colab to clean up the directory names. Do you have any advice on bringing this into River's original colab?

ModuleNotFoundError: No module named 'cgd'

Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.