GithubHelp home page GithubHelp logo

ali-vilab / videocomposer Goto Github PK

View Code? Open in Web Editor NEW
874.0 25.0 79.0 34.11 MB

Official repo for VideoComposer: Compositional Video Synthesis with Motion Controllability

Home Page: https://videocomposer.github.io

License: MIT License

Python 99.52% Shell 0.48%
diffusion-models video-generation video-synthesiswith

videocomposer's Issues

How to Achieve Video Inpainting?

There are examples of video inpainting on the project page, but I seem to be unable to achieve satisfactory inpainting results. Does the current version of code support video inpainting? Are there any script examples available?

metric-related

Hi, I see that there is such a setting in the paper: Given a scretch or image, and then adding text to generate a video, have you tested this setting with some evaluation metrics, I don’t see it in the paper. In addition, for table 2 in the paper, I see that the frame consistency of the video generated by the given depth sequence (motion information) and text is not very high, is it because of jitter? If possible, can you share with me the IDs of the 1000 Webvid videos used for testing in table 2, and we may follow your work to compare our methods.

How to modify the resolution of generated video?

Very interesting job! When I used it, I found that the model could only generate videos with a resolution of 256. How should I modify the parameters to make the model generate videos with a higher resolution?

I tried changing the "resolution" parameter in the config, but it didn't work.

Looking forward to your reply, thank you!

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I configured the corresponding environment and then ran the code, and I wonder why I get the following error because I didn't modify your code and follow the README.md documentation.

[2023-06-20 10:30:44,328] INFO: Created a model with 1347M parameters
[mpeg4 @ 0xeeaa0c40] Application has requested 128 threads. Using a thread count greater than 16 is not recommended.
/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
warnings.warn(
[2023-06-20 10:30:47,417] INFO: GPU Memory used 12.86 GB
Traceback (most recent call last):
File "run_net.py", line 36, in
main()
File "run_net.py", line 31, in main
inference_single(cfg.cfg_dict)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 351, in inference_single
worker(0, cfg)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 695, in worker
video_output = diffusion.ddim_sample_loop(
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 236, in ddim_sample_loop
xt, _ = self.ddim_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, ddim_timesteps, eta)
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 200, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 166, in p_mean_variance
var = _i(self.posterior_variance, t, xt)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 13, in _i
return tensor[t].view(shape).to(x)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

About hand crafted motion guidance

Do you plan to release the code for video generation using hand-crafted motion guidance?
If not, can you share the input data format of the hand-crafted motion and initial object position?

For instance, in the following image, the initial object position is depicted as a red star, and the motion is described as a red arrow.
image

How are they transformed to be a network input?
Use this red-black image as it is? or split the arrow into multiple pieces and use another image for the initial object position?

Thank you in advance.

Text + SingleImage not right

When I use text = "Smiling woman in cowboy hat with wheat ears" and single image and single image as below :
hat_woman
and the execute python run_net.py\ --cfg configs/exp04_sketch2video_wo_style.yaml\ --seed 144\ --sketch_path "demo_video/hat_woman.png"\ --input_text_desc "Smiling woman in cowboy hat with wheat ears"

Then i get gif as follow, and is not same as paper, How can I get the result as the paper? Thank you
S144

cannnot install/import flash_attn

python 3.8.16
cuda 11.3
torch 1.12.0
gpu geforce rtx 3090

with flash-attn 0.2
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
failed to build when i tried pip install flash-attn==0.2

so i tried with newest version
with flash-attn 2.5.6
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/__init__.py", line 3, in <module> from flash_attn.flash_attn_interface import ( File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module> import flash_attn_2_cuda as flash_attn_cuda ImportError: /home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE

please let me know why this error is occured.

only text+image generation

Thanks for the excellent work. After configuring the env and run some scripts, I get some pretty results.
And I wonder without the video input, how can I get the result like the project page shows as belowing pictures?
image
image

Question about extracting the different conditions

Thank you for the great repo.

When i try to run the code with the following command:
python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999

some errors are occurred.
image

It seems that all conditional extractions have failed。
rank_1-0

Could you tell me how to fix it ?

How was the pretraining dataset Laion-400M used ? does this actually refer to the use of the ‘open_clip_pytorch_model’ from OPENCLIP ?

您好,请教下,论文里提到的用Laion-400M预训练,是指用Laion-400M对VideoComposer做了额外的预训练 ?如果是的话,预训练的输入组织方式,和参与训练的算法模块,可以讲解一下吗? 谢谢 ~

PS: 看代码里和Laion相关的有2个预训练模型,没有找到Laion-400M相关的,是不是我理解错了?

  • "v2-1_512-ema-pruned.ckpt" :预训练是用Laion5B
  • “open_clip_pytorch_model” : 预训练是Laion2B (OPENCLIP里的“ViT-H-14", pretrained="laion2b_s32b_b79k”)

GLIBC_2.32 not found

Hi, has anyone faced this issue and/or knows a fix?
The proper GLIBC-Version can not be found. We have a newer one (2.35) installed on our image.
It occurs when trying to run the basic inference script:

python run_net.py\
    --cfg configs/exp02_motion_transfer.yaml\
    --seed 9999\
    --input_video "demo_video/motion_transfer.mp4"\
    --image_path "demo_video/moon_on_water.jpg"\
    --input_text_desc "A beautiful big moon on the water at night"

Is 2.32 specifically necessary or could there be another issue?

ImportError: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found

I would like to ask this before considering downgrading GLIBC.
Thank you.

CUDA error: no kernel image is available for execution on the device

We run the scripts for inference a video by the command python run_net.py \ --cfg configs/exp01_vidcomposer_full.yaml \ --input_video "demo_video/blackswan.mp4" \ --input_text_desc "A black swan swam in the water" \ --seed 9999 . We get the error as follows,
File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/videocomposer/tools/videocomposer/unet_sd.py", line 238, in forward out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op) File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 574, in memory_efficient_attention return op.forward_no_grad( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 189, in forward_no_grad return cls.FORWARD_OPERATOR( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: CUDA error: no kernel image is available for execution on the device

The version of torch is the same as yours. The version of cuda is 11.3, and torch==1.12.0+cu113, torchvision==0.13.0+cu113. We use a V100, and when we execute nvidia-smi the cuda version shown on V100 is 11.4. We think the version of our machine is compatible, and we do not know where the problem is.

torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV

Traceback (most recent call last):
  File "run_net.py", line 37, in <module>
    main()
  File "run_net.py", line 32, in main
    inference_single(cfg.cfg_dict)
  File "/root/videocomposer/tools/videocomposer/inference_single.py", line 353, in inference_single
    mp.spawn(worker, nprocs=cfg.gpus_per_machine, args=(cfg, ))
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV
free(): invalid pointer
已放弃 (核心已转储)

Python==3.8.5
torch=='1.13.1+cu117'
transformers==4.19.2
xformers==0.0.16

No FlashAttetion

Training code

Hi,
awesome work! I wonder if you're going to release also the training code. I would like to finetune the model on my own data. Is it possible in the future?

Video inpainting using a sequence of masks

Hi Shiwei, @Steven-SWZhang

Thank you for publically making available the great work you have done.
I have been trying to reproduce the results for the task "video inpainting using a sequence of masks". More specifically, I have a video including 10 frames and 10 masks corresponding to those 10 frames of videos. So, I would like to feed the video alongside the sequence of masks and a text prompt to the model. So, I expect to get a temporally consistent video as an output in a way that the output video adheres to the sequence of masks and the text prompt.

However, I could not see any argument for the input mask. So, I went through the code, and as far as I understood, it seems that the code itself generates a random mask on the input video. The code below (inference_single.py) shows my explanation:

image

in which the function "make_masked_images" is:

image

So, as far as I realized, this "mask" variable in line 564 of the first snapshot is initialized with the "batch" variable (which comes from the dataloader) in the picture below:

image

So, when I went through the "dataset.py" code, I found out the mask is somehow randomly generated as the following:

image

So, my understanding is that the code only conditions the model on this randomly generated mask. So, if my understanding is correct, does it mean that we cannot feed an external sequence of masks to the model? If the understanding is not correct, I would appreciate it if you could explain how I can feed the sequence of masks to the model as I could not find anything in the code.

Thank you in advance for putting time into this case.

Kind Regards,
Amir

not found for model **ViT-H-14.**

when I run !python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999

[2023-08-31 10:58:03,125] INFO: Loading ViT-H-14 model config.
[2023-08-31 10:58:15,720] WARNING: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.
Traceback (most recent call last):
File "/content/vc/run_net.py", line 36, in
main()
File "/content/vc/run_net.py", line 28, in main
inference_multi(cfg.cfg_dict)
File "/content/vc/tools/videocomposer/inference_multi.py", line 345, in inference_multi
worker(0, cfg)
File "/content/vc/tools/videocomposer/inference_multi.py", line 421, in worker
clip_encoder = FrozenOpenCLIPEmbedder(layer='penultimate',pretrained = DOWNLOAD_TO_CACHE(cfg.clip_checkpoint))
File "/content/vc/tools/videocomposer/inference_multi.py", line 108, in init
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
model = create_model(
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
RuntimeError: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.

cannot install motion-vector-extractor

Hello, I cannot install motion-vector-extractor by pip or the source code since the arch of my computer may has some incompatible issues with ffmpeg, e.g., libavformat.
Is there any way that I can replace the motion-vector-extractor? maybe simply opencv default optical flow extractor?
By the way, great work!

killed

My GPU is V100 32G;
When using the pre-trained model to inference, an error will often be killed, but sometimes it can run successfully, has anyone encountered this situation
[2023-08-11 13:09:34,289] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:09:47,112] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
[2023-08-11 13:09:52,142] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:10:04,689] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
run_bash.sh: line 49: 2445 Killed python run_net.py --cfg configs/exp04_sketch2video_wo_style.yaml --seed 144 --sketch_path "demo_video/src_single_sketch.png" --input_text_desc "A little bird is standing on a branch"

How to train the model

Hi, guys, thx your wonderful work, but I'm intense curiosity about when will open your train script.

ResolvePackageNotFound

配置环境
D:\videocomposer>conda env create -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

  • pip==23.0.1=py38h06a4308_0
  • tk==8.6.12=h1ccaba5_0
  • zlib==1.2.13=h5eee18b_0
  • libstdcxx-ng==11.2.0=h1234567_1
  • sqlite==3.41.1=h5eee18b_0
  • python==3.8.16=h7a1cb2a_3
  • ca-certificates==2023.01.10=h06a4308_0
  • ncurses==6.4=h6a678d5_0
  • wheel==0.38.4=py38h06a4308_0
  • _openmp_mutex==5.1=1_gnu
  • xz==5.2.10=h5eee18b_1
  • libgomp==11.2.0=h1234567_1
  • libffi==3.4.2=h6a678d5_6
  • libgcc-ng==11.2.0=h1234567_1
  • setuptools==65.6.3=py38h06a4308_0
  • ld_impl_linux-64==2.38=h1181459_1
  • readline==8.2=h5eee18b_0
  • certifi==2022.12.7=py38h06a4308_0
  • openssl==1.1.1t=h7f8727e_0

support for other resolutions

So it seems outputs are always 256x256 - is this a limitation of the model, or is that just a choice based on vram requirements? Would it be easy to enable other resolutions?

计算代价询问

作者你好,看到你们的工作之后感觉非常有意思。请问本工作的训练计算资源要求是什么样的?例如说训练过程需要什么级别的显卡,多少张,训练多久?

fvd performance on MSR-VTT

Thank you for presenting such an exciting work. Congratulations!

I have a question regarding Table A3. Could you please provide more details on how the FVD is calculated? As this metric can be very sensitive to certain settings, I would like to know more about the resolution (256?), number of frames, and how you processed the captions. Additionally, I noticed that there are multiple correspondences to the same video. Could you please explain how you handled this?

Thank you!

the running script for “sequence-to-video generation”

Hello, great work. May I ask when will the running script for “sequence-to-video generation” such as "sketch sequence-to-video generation" and "Compositional depth sequence-to-video generation" be released?
Thanks for your work!

有关于 damo-vilab/MS-Vid2Vid-XL 的参考资料吗?

感觉你们的这个网络可以作为很多视频合成的后续优化处理模块,
请问有这部分原理及使用的相关细节介绍吗?
谢谢

另外也问一下 你们对后续添加outpainting功能有计划吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.