ali-vilab / videocomposer Goto Github PK

View Code? Open in Web Editor NEW

874.0 25.0 79.0 34.11 MB

Official repo for VideoComposer: Compositional Video Synthesis with Motion Controllability

Home Page: https://videocomposer.github.io

License: MIT License

Python 99.52% Shell 0.48%

diffusion-models video-generation video-synthesiswith

videocomposer's Issues

How to Achieve Video Inpainting？

There are examples of video inpainting on the project page, but I seem to be unable to achieve satisfactory inpainting results. Does the current version of code support video inpainting? Are there any script examples available?

Will this be unreleased because of audits?

metric-related

Hi, I see that there is such a setting in the paper: Given a scretch or image, and then adding text to generate a video, have you tested this setting with some evaluation metrics, I don’t see it in the paper. In addition, for table 2 in the paper, I see that the frame consistency of the video generated by the given depth sequence (motion information) and text is not very high, is it because of jitter? If possible, can you share with me the IDs of the 1000 Webvid videos used for testing in table 2, and we may follow your work to compare our methods.

How to modify the resolution of generated video?

Very interesting job! When I used it, I found that the model could only generate videos with a resolution of 256. How should I modify the parameters to make the model generate videos with a higher resolution?

I tried changing the "resolution" parameter in the config, but it didn't work.

Looking forward to your reply, thank you!

Question about Rescaling `noise` according to `guidance_rescale`

Thank you for releasing your code.
Have you used guidance scaling to your noise while training?
After finetuning your model, the color tone of the generated video turned a bit yellow.
Any suggestion?
Thanks a lot. : )

.

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I configured the corresponding environment and then ran the code, and I wonder why I get the following error because I didn't modify your code and follow the README.md documentation.

[2023-06-20 10:30:44,328] INFO: Created a model with 1347M parameters
[mpeg4 @ 0xeeaa0c40] Application has requested 128 threads. Using a thread count greater than 16 is not recommended.
/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
warnings.warn(
[2023-06-20 10:30:47,417] INFO: GPU Memory used 12.86 GB
Traceback (most recent call last):
File "run_net.py", line 36, in
main()
File "run_net.py", line 31, in main
inference_single(cfg.cfg_dict)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 351, in inference_single
worker(0, cfg)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 695, in worker
video_output = diffusion.ddim_sample_loop(
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 236, in ddim_sample_loop
xt, _ = self.ddim_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, ddim_timesteps, eta)
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 200, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 166, in p_mean_variance
var = _i(self.posterior_variance, t, xt)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 13, in _i
return tensor[t].view(shape).to(x)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

About hand crafted motion guidance

Do you plan to release the code for video generation using hand-crafted motion guidance?
If not, can you share the input data format of the hand-crafted motion and initial object position?

For instance, in the following image, the initial object position is depicted as a red star, and the motion is described as a red arrow.

How are they transformed to be a network input?
Use this red-black image as it is? or split the arrow into multiple pieces and use another image for the initial object position?

Thank you in advance.

Text + SingleImage not right

When I use text = "Smiling woman in cowboy hat with wheat ears" and single image and single image as below :

and the execute python run_net.py\ --cfg configs/exp04_sketch2video_wo_style.yaml\ --seed 144\ --sketch_path "demo_video/hat_woman.png"\ --input_text_desc "Smiling woman in cowboy hat with wheat ears"

Then i get gif as follow, and is not same as paper, How can I get the result as the paper? Thank you

cannnot install/import flash_attn

python 3.8.16
cuda 11.3
torch 1.12.0
gpu geforce rtx 3090

with flash-attn 0.2
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
failed to build when i tried pip install flash-attn==0.2

so i tried with newest version
with flash-attn 2.5.6
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/__init__.py", line 3, in <module> from flash_attn.flash_attn_interface import ( File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module> import flash_attn_2_cuda as flash_attn_cuda ImportError: /home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE

please let me know why this error is occured.

only text+image generation

Thanks for the excellent work. After configuring the env and run some scripts, I get some pretty results.
And I wonder without the video input, how can I get the result like the project page shows as belowing pictures?

Question about extracting the different conditions

Thank you for the great repo.

When i try to run the code with the following command:
python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999

some errors are occurred.

It seems that all conditional extractions have failed。

Could you tell me how to fix it ?

The download for the model weights isn't working.

Code update

@Steven-SWZhang Thanks for the code. I'm getting the below error while running the code:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

On below line:

https://github.com/damo-vilab/videocomposer/blob/884dc223f7717c80d07b367d36b7bf51039ef6af/artist/ops/diffusion.py#L13

May I know if I have to pass any parameter to run on gpu.

How was the pretraining dataset Laion-400M used ? does this actually refer to the use of the ‘open_clip_pytorch_model’ from OPENCLIP ?

您好，请教下，论文里提到的用Laion-400M预训练，是指用Laion-400M对VideoComposer做了额外的预训练？如果是的话，预训练的输入组织方式，和参与训练的算法模块，可以讲解一下吗？谢谢 ~

PS: 看代码里和Laion相关的有2个预训练模型，没有找到Laion-400M相关的，是不是我理解错了？

"v2-1_512-ema-pruned.ckpt" ：预训练是用Laion5B
“open_clip_pytorch_model” ：预训练是Laion2B （OPENCLIP里的“ViT-H-14", pretrained="laion2b_s32b_b79k”）

looking forward to the open source of trainning code

as the theme saying.

Can you share bash command about "A moving golden moon" case?

Do current code support to add hand-crafted motion condition? If so, can you share this configure and demo pictures?

GLIBC_2.32 not found

Hi, has anyone faced this issue and/or knows a fix?
The proper GLIBC-Version can not be found. We have a newer one (2.35) installed on our image.
It occurs when trying to run the basic inference script:

python run_net.py\
    --cfg configs/exp02_motion_transfer.yaml\
    --seed 9999\
    --input_video "demo_video/motion_transfer.mp4"\
    --image_path "demo_video/moon_on_water.jpg"\
    --input_text_desc "A beautiful big moon on the water at night"

Is 2.32 specifically necessary or could there be another issue?

ImportError: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found

I would like to ask this before considering downgrading GLIBC.
Thank you.

plan for releasing the training code?

Hi @Steven-SWZhang , thanks for the awesome work.
Do you have any plans for releasing the training code?

CUDA error: no kernel image is available for execution on the device

We run the scripts for inference a video by the command python run_net.py \ --cfg configs/exp01_vidcomposer_full.yaml \ --input_video "demo_video/blackswan.mp4" \ --input_text_desc "A black swan swam in the water" \ --seed 9999 . We get the error as follows,
File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/videocomposer/tools/videocomposer/unet_sd.py", line 238, in forward out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op) File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 574, in memory_efficient_attention return op.forward_no_grad( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 189, in forward_no_grad return cls.FORWARD_OPERATOR( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: CUDA error: no kernel image is available for execution on the device

The version of torch is the same as yours. The version of cuda is 11.3, and torch==1.12.0+cu113, torchvision==0.13.0+cu113. We use a V100, and when we execute nvidia-smi the cuda version shown on V100 is 11.4. We think the version of our machine is compatible, and we do not know where the problem is.

torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV

Traceback (most recent call last):
  File "run_net.py", line 37, in <module>
    main()
  File "run_net.py", line 32, in main
    inference_single(cfg.cfg_dict)
  File "/root/videocomposer/tools/videocomposer/inference_single.py", line 353, in inference_single
    mp.spawn(worker, nprocs=cfg.gpus_per_machine, args=(cfg, ))
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV
free(): invalid pointer
已放弃 (核心已转储)

Python==3.8.5
torch=='1.13.1+cu117'
transformers==4.19.2
xformers==0.0.16

No FlashAttetion

Training code

Hi,
awesome work! I wonder if you're going to release also the training code. I would like to finetune the model on my own data. Is it possible in the future?

Some questions about STC encoder

Hello, do all control conditions share the same STC encoder model weight?

Video inpainting using a sequence of masks

Hi Shiwei, @Steven-SWZhang

Thank you for publically making available the great work you have done.
I have been trying to reproduce the results for the task "video inpainting using a sequence of masks". More specifically, I have a video including 10 frames and 10 masks corresponding to those 10 frames of videos. So, I would like to feed the video alongside the sequence of masks and a text prompt to the model. So, I expect to get a temporally consistent video as an output in a way that the output video adheres to the sequence of masks and the text prompt.

However, I could not see any argument for the input mask. So, I went through the code, and as far as I understood, it seems that the code itself generates a random mask on the input video. The code below (inference_single.py) shows my explanation:

in which the function "make_masked_images" is:

So, as far as I realized, this "mask" variable in line 564 of the first snapshot is initialized with the "batch" variable (which comes from the dataloader) in the picture below:

So, when I went through the "dataset.py" code, I found out the mask is somehow randomly generated as the following:

So, my understanding is that the code only conditions the model on this randomly generated mask. So, if my understanding is correct, does it mean that we cannot feed an external sequence of masks to the model? If the understanding is not correct, I would appreciate it if you could explain how I can feed the sequence of masks to the model as I could not find anything in the code.

Thank you in advance for putting time into this case.

Kind Regards,
Amir

hand-crafted motions and hand-crafted sketch

Can you tell me how to set up this function?

The error when install the flash-attn

Hi, when I install the flash-attn package, there is no reaction in the process of building

the running script for “Text,style,hand-crafted motions and hand-carfted sketch”

Hello, great work. May I ask when will the running script for “Text,style,hand-crafted motions and hand-carfted sketch” be released?
Thanks for your work!

not found for model ViT-H-14.

when I run !python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999

[2023-08-31 10:58:03,125] INFO: Loading ViT-H-14 model config.
[2023-08-31 10:58:15,720] WARNING: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.
Traceback (most recent call last):
File "/content/vc/run_net.py", line 36, in
main()
File "/content/vc/run_net.py", line 28, in main
inference_multi(cfg.cfg_dict)
File "/content/vc/tools/videocomposer/inference_multi.py", line 345, in inference_multi
worker(0, cfg)
File "/content/vc/tools/videocomposer/inference_multi.py", line 421, in worker
clip_encoder = FrozenOpenCLIPEmbedder(layer='penultimate',pretrained = DOWNLOAD_TO_CACHE(cfg.clip_checkpoint))
File "/content/vc/tools/videocomposer/inference_multi.py", line 108, in init
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
model = create_model(
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
RuntimeError: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.

cannot install motion-vector-extractor

Hello, I cannot install motion-vector-extractor by pip or the source code since the arch of my computer may has some incompatible issues with ffmpeg, e.g., libavformat.
Is there any way that I can replace the motion-vector-extractor? maybe simply opencv default optical flow extractor?
By the way, great work!

killed

My GPU is V100 32G;
When using the pre-trained model to inference, an error will often be killed, but sometimes it can run successfully, has anyone encountered this situation
[2023-08-11 13:09:34,289] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:09:47,112] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
[2023-08-11 13:09:52,142] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:10:04,689] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
run_bash.sh: line 49: 2445 Killed python run_net.py --cfg configs/exp04_sketch2video_wo_style.yaml --seed 144 --sketch_path "demo_video/src_single_sketch.png" --input_text_desc "A little bird is standing on a branch"

How to train the model

Hi, guys, thx your wonderful work, but I'm intense curiosity about when will open your train script.

ResolvePackageNotFound

配置环境
D:\videocomposer>conda env create -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

pip==23.0.1=py38h06a4308_0
tk==8.6.12=h1ccaba5_0
zlib==1.2.13=h5eee18b_0
libstdcxx-ng==11.2.0=h1234567_1
sqlite==3.41.1=h5eee18b_0
python==3.8.16=h7a1cb2a_3
ca-certificates==2023.01.10=h06a4308_0
ncurses==6.4=h6a678d5_0
wheel==0.38.4=py38h06a4308_0
_openmp_mutex==5.1=1_gnu
xz==5.2.10=h5eee18b_1
libgomp==11.2.0=h1234567_1
libffi==3.4.2=h6a678d5_6
libgcc-ng==11.2.0=h1234567_1
setuptools==65.6.3=py38h06a4308_0
ld_impl_linux-64==2.38=h1181459_1
readline==8.2=h5eee18b_0
certifi==2022.12.7=py38h06a4308_0
openssl==1.1.1t=h7f8727e_0

Keypoin or skelton video generation!

Is there any plan to add keypoint or skeleton sequence based human video generation?

How to adjust parameters to ensure consistent output video multiple times?

support for other resolutions

So it seems outputs are always 256x256 - is this a limitation of the model, or is that just a choice based on vram requirements? Would it be easy to enable other resolutions?

计算代价询问

作者你好，看到你们的工作之后感觉非常有意思。请问本工作的训练计算资源要求是什么样的？例如说训练过程需要什么级别的显卡，多少张，训练多久？

Error while processing rearrange-reduction pattern "(i j) c f h w -> c f (i h) (j w)". Input tensor shape: torch.Size([3, 16, 256, 256]). Additional info: {'i': 1}.

about the script of I2V generation

Hi, thanks for sharing the great work. I woder if you could provide the inference code for image to video generation? Thanks!

fvd performance on MSR-VTT

Thank you for presenting such an exciting work. Congratulations!

I have a question regarding Table A3. Could you please provide more details on how the FVD is calculated? As this metric can be very sensitive to certain settings, I would like to know more about the resolution (256?), number of frames, and how you processed the captions. Additionally, I noticed that there are multiple correspondences to the same video. Could you please explain how you handled this?

Thank you!

根据conda env create -f environment.yaml 安装环境报错，请问这个该如何解决呢？

ali-vilab / videocomposer Goto Github PK

videocomposer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs