GithubHelp home page GithubHelp logo

miccunifi / ladi-vton Goto Github PK

View Code? Open in Web Editor NEW
377.0 377.0 48.0 1.57 MB

This is the official repository for the paper "LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On". ACM Multimedia 2023

License: Other

Python 100.00%
acmmm acmmm2023 dresscode fashionai generative-model latent-diffusion-models stable-diffusion textual-inversion virtual-tryon viton-hd

ladi-vton's People

Contributors

abaldrati avatar giuseppecartella avatar marcellacornia avatar omedivad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ladi-vton's Issues

training code

Hi,

Thanks for your great work and congrats on your acceptance.

May I know when you will release the training code?

How to use this on custom dataset?

Hey I am trying to use the model in custom dataset, here you can look I am doing the inference on this image but the result was not what was expected. How can use this model on custom data? do I need to run finetuning on a specific individual to understand its body type?

vton vton_gen

[Colab Guide] Drive Link for Quick Inference on custom data with LADI-VTON using Colab

I have made a Colab preprocessing pipeline for Ladi-VTON which can run inference on custom data using the DressCode model.

Here is the my drive Link https://drive.google.com/drive/folders/19XL0kvTw6SoCCAOJY9FgvuQJ9M_JAZHt?usp=sharing
You will need to make a copy of my drive in your google drive with the same name first and use GPU on Colab

I have made pre-processing usable for the DressCode dataset.
Keep your input images in /images folder and write the test pairs properly.
Then after running ladi-vton_DressCode.ipynb input folder for inference will automatically be generated.

By running inference on custom data, ladi-vton messes up the faces. So I have made a Refinement Notebook using Google Mediapipe just for this purpose,
The intermediate results after inference are in results folder, and final results after refinement will be in final folder.

I have used this exact drive to generate some results and it mostly works. There are some problems with few specific garments.
Thinking to shift the drive into a Github repo after sometime. If you have any doubts/suggestions you can post.

Ask for training gpu spec

Thank you for sharing your work. I'm now trying to train your module with 8 GPUs with gpu memory 15GB each, but it shows OOM error. Can you share your training gpu spec? How many GPUs did you use to train VTO module?

Some questions about the training process

Thanks to the authors for such influential work.

  1. I would like to confirm if the EMASC module is trained by reconstruction loss, as mentioned in the article for L1 and VGG loss. How is its input constructed, is it a model image I, a mapping of I to \tilde{I}?

  2. During the training of the enhanced Stable Diffusion pipeline, is there any model image I used as input? Because the sampling operation of the model image I appears in Equation 3 and Equation 4.
    image

  3. I am more curious about how the whole training process of diffusion-based models actually works. Not limited to this work, I have similar confusion in other work and I hope to get help from the authors here. What I think about the mechanism of the diffusion model is that it is fitting the distribution of the data set, so what is the underlying principle that it is being applied to the try-on task. What is it fitting in a given task? In other words, similar to the second question, how to effectively use the ground truth picture, i.e. the model picture I.

issue with yml file

Thanks for your great work. I have a hard time to build the some environment on windows system, and I noticed your yml file was built on linux system. Could you upload a yml file for windows? thanks in advance.

Bad Result on custom image from DressCode Dataset

Hi Folks,

I tried inferencing on single image taken from dresscode with all preprocessed data from the original source data itself with minor tweaks. I am getting unexpected results from the custom inference.

Even when I am doing preprocessing myself the results are similar. I have attached input and output images as reference.(Pose map has 18 channels so couldn't visualize it properly here).

image image

Can anyone help me here?

Poor result on kid score

Hey, great project!
I want to replicate the results of paired dataset, other metric score is same as shown in the paper, but the kid score is very smaller than that shown in paper:
53fb4ab9-75da-459f-be72-d7232739b3cd

Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Asking

i was just wondering when is the training code going to be released, Thank you in advance

Real world results

Hi, thank you for such a nice work. I was wondering, have you tried your model on real-world data outside of the datasets you mentioned in the paper, both target model and garment?

Also, a question regarding training, have you merged those datasets for training or have you trained separate models for each dataset?

Bad Generated Images

Hi,
Appreciate the great work and contribution.
I tested the ladi-vton model on a large number of images on VITON-HD dataset. Some of them which are not working properly i am sharing below

  1. if the model image 'I' is wearing a full sleeves cloth and if i try to replace it with sleeveless or half sleeves, the portion of sleeve remains on the body [image 2].
  2. It kind of tries to inpaint the mask portion exactly which doesn't look perfect in some images and the mask portion is visible at the bottom where the style is not in-shirt type
  3. The occlusion part is not handled perfectly , the images with occlusion are not generate perfectly, in fact results are very distorted [image1]
  4. And yes, the texture information is not preserved of the cloth image properly [image4]

What could be the reasons for these problems and does finetuning or training with images on a larger number of sleeve sleeveless combination would resolve this issue?

image
image
image
image

How can i run this project with m1 mac

I followed all instructions which this project provide, and I get this error:

ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU

After that I removed --enable_xformers_memory_efficient_attention argument from my command, the error changed like this:

...src/inference.py", line 226, in main
    generator = torch.Generator("cuda").manual_seed(args.seed)
                ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device type CUDA is not supported for torch.Generator() api.

I've searched that error and then I've found pytorch mps document. I changed the src/inference.py:226 code to:

generator = torch.Generator("mps").manual_seed(args.seed)

So maybe I've been trying something wrong, because it didn't work too. I was going to try without GPU but i didn't do it. Is there a way to close cuda/GPU?

Thank you

My environment:
Macbook Pro 14-inch, 2021
chip: Apple M1 Pro
os: 13.6 (22G120)
python: 3.11

Issue with training VTO & Inversion Adapter

Hi,

I'm trying to train all the model with 1024x768 images. I performed to train TPS & EMASC using this shape with some code modifications. Training works well according to metrics and visuals results.

But, it doesn't work at all for Inversion adapter and VTO. Both training produce no loss reduction during training (close to constant using hard smoothing on wandb and very oscillating without smoothing).
Screenshot 2023-09-18 at 18 15 22
I tested also using 512x384 shape and it gives me the same results.
Is it an expected result ?

I'm using default parameters except batch_size = 8 for VTO and batch_size=1 for Inversion adapter on a single A100 GPU. I assume that a greater value than 1 could prevent this training issue, but my HW doesn't allow to use a bigger one 😞
I tried to reduce learning rate but it results to the same issue.

Commands used to train Inversion adapter and VTO :

  • python src/train_inversion_adapter.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/inverter_1024 --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --allow_tf32 --pretrained_model_name_or_path pretrained_models/stable-diffusion-2-inpainting/ --height 1024 --width 768 --train_batch_size 1 --test_batch_size 1

  • python src/train_vto.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/vto_1024 --inversion_adapter_dir checkpoints/inverter_1024/ --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --height 1024 --width 768 --train_batch_size 8 --test_batch_size 8 --allow_tf32

Could you please, help me to resolve this pb ?
Thanks for your clean work btw :)

Not works well in Textual or Letters

By running command:
python src/inference.py --dataset vitonhd --vitonhd_dataroot zalando-hd-resized --output_dir output --test_order paired --batch_size 1 --mixed_precision fp16
I found it not works well in textual or letters, like those badcases:
image
image
image
image
That phenomenon is not mentioned in your paper,is there any way to fix it ?

questions about the KID metric

Hi,thank you for your great work!

After inferencing on VITON-HD dataset with your released model, I got KID_p 0.0015 (1.08 in your paper), and KID_u 0.0018 (1.60 in your paper). Why such a big difference?

Thanks for any advices.

Out of memory when training the emasc module

Hello, I'm a beginner in artificial intelligence. Your work is very good. I'm very interested in it, but I'm trying to reproduce it now. When I train the EMASC module, there is always an error prompt of "out of memory". Do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.69 GiB total capacity; 22.12 GiB already allocated; 103.25 MiB free; 22.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

could not get good result

Thanks for your great work. I want to test your model trained by dresscode. However, I don't have dresscode. I use the vitonhd.py to load vitonhd training data and use the pretrained model by dresscode. Unluckly, the sleeve is over original boundary. Can you give me same advice.
image

Another question, when I test your code by vitonhd(vitonhd.py to load data and pretrained by vitonhd),I find that the sleeve is not match my human parsing. Can I give depth/canny control to your diffusion model?
image

Thanks for any advices.

Problem about training

Hi,thank you for your great work!

I was trying to wrtie train code and do some training, but I was confused by the We first train the EMASC modules, the textual-inversion adapter,and the warping component. Then, we freeze all the weights of allmodules except for the textual inversion adapter and train the proposed enhanced Stable Diffusion pipeline in 4.2, should I first freeze other weights including unet and train textual inversion adapter or should I free other weight and train textual inversion adapter and unet together。

poor result

I have tested on VitonHD dataset and getting very poor results, see command below:

python3 src/inference.py --dataset vitonhd --vitonhd_dataroot /content/VITON-HD --output_dir ./ --test_order unpaired --category all --batch_size 8 --mixed_precision fp16 --num_workers 8

01265_00

08646_00

11078_00

Use lower_body, dresses and all

Thanks for your project! how can i use your project for trying on bottoms (pants, skirts) or for dresses? I tested on the VITON-HD dataset, and I got only the top.

I can't run VTON dataset.

Error Message:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 76: character maps to <undefined>

Not realistic

Hi, so I tried using this model but the result generated is not realistic. Do you know why this can be. The upper body shirt on this is tried on using the model. It is this shirt.
image

image

How about to maintain details when warping?

hi,I found that a lot of information is lost in the results after warping (such as logos and patterns). It seems that without refinement and only using TPS, more details will be retained. Has the author done any ablation experiments using only TPS training?

Time and hardware for inference and training

Hi! Congrats on the excellent paper!

Could you tell me how much time it takes to run an inference and on which hardware?

Also, with which hardware did you train and for how long?

Thanks!!

try on tattoos

Can I use this project to try on tattoos? If yes, what I need to do?

Traning Code

Hello! First, I want to congratulate you on these amazing results. I also wanted to ask when the training code will be available. Additionally, is this similar to fine-tuning the stable diffusion model on but on multiple concepts ?

Working of ladi-vton with the DressCode dataset

The model trained on VITON-HD is amazing, it definitely is one of the best
Is it just me, or the results generated using Dresscode no where near the VITON-HD model.
The model not being able to work with Textual details on t-shirts is understandable.
But I have actually written a pre-processing pipeline in Colab and run inference on custom data, and even a little miss in the parsing's generates a totally bad image. I am now getting perfect pre-processings with the size of data and positioning also right but the Dresscode model literally doesn't work with faces. All the faces in the final result are distorted as attached below . Aren't the EMASC modules supposed to restore the face?

Is there anything which can be done to solve this issue or is it not possible?

res2
res1
res3****

Poor results

Hi , I've been attempting to replicate the results you demonstrated in Figure 7 of your paper. However, the outcome is not as presented in your paper. Specifically, the pattern on the T-shirt is not being reproduced.
image
Here's the same garment I found in zalando-hd
00579_00
Here is the result when I run your code.
00654_00_00579_00
Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Clothing replaced by same clothing

We're running inference to recreate paper results using the VITON-HD dataset (test_pairs.txt in our conda environment), however the results appear to be slightly modified versions of the original clothing. It appears as if it's going through a diffusion pass, but not applying the clothing.

Running on a Windows 11 4090 PC, following the default settings/commands provided.

Images:
original_image
clothing
final_image

Could I ask you for some advice?

I want to use a pre-trained large model, but the input requirements for the model are generally square. For human body images, which are generally rectangular, how do I process the image to meet the needs of the pre-trained model? Simply filling in the blanks seems to make the whole image more sparse.

Outdated arguments in function

The Encoder uses the get_down_block here using parameter attn_num_head_channels but get_down_block doesn't have such parameter in newer versions of the diffusers library

how to prepare text prompt

Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.

Questions about extending the first convolutional layer

Congrats on your work! In the paper, you mentioned that:

we propose to extend the kernel channels of the first convolutional layer by adding zero initialized weights to match the new input channel dimension

Will you also fine-tune the first convolutional layer or the stable diffusion model during your training to accommodate for the channel change?

BTW, will the code be released before the end of June?

Question about ablation study

Hi, thanks for sharing your great work!

I'm very interested in exploring the application of LDM in virtual try-on and inspired by your work. But I'm confused by the second third row of Tab. 4 in your paper.

I notice the performance doesn't drop obviously with empty strings (row 1) or textual elements (row 2). How can I get textual elements? Maybe pass the garment images directly through VE to U-NET?

Moreover, why does the performance drop dramatically with f_theta? even much worse than empty strings?

Looking forward to your reply! Thank you again!

What data will affect inference results of VITON-HD?

I found that only cloth,image,openpose-json,image-parse-v3 data are needed when inference with VITON-HD datasets.
If I don't provide the data cloth-mask,image-parse-agorisc-v3.2 and so on, will it have any impact on the inference results?
Thank you.

Training Code

First of all, thank you for this amazing work ! Do you plan to release the training code as well?

use own model images

What projects (neural networks) should I use to get images as image-parse-v3 and openpose_json files? I want to use my model images, but this requires their images in these formats if I understand the project correctly.

VAE with intermediate features takes up more GPU memory than original VAE

Hi, your work is so wonderful! Here is some questions.

I noticed that declaring val_pipe in the training code as an instance of StableDiffusionTryOnePipeline will occupy a very large amount of GPU memory, and inference.py itself will also occupy a large amount of GPU memory when running. It would be much better to replace VAE with intermediate features with original VAE. Have you noticed it? May I ask what GPU you are running on when running inference.py?

Thank you!

How to get image "image-parse-v3"

To run the inference.py script, you must have at least images in the "cloth", "image", "image-parse-v3", "openpose_json folders" (i use vitonhd dataset) . Everything is clear with the "images", "cloth" and "image". I also learned how to receive "openpose_json" files. But I can't find how to get images for "image-parse-v3"? help me please

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.