ali-vilab / videocomposer Goto Github PK
View Code? Open in Web Editor NEWOfficial repo for VideoComposer: Compositional Video Synthesis with Motion Controllability
Home Page: https://videocomposer.github.io
License: MIT License
Official repo for VideoComposer: Compositional Video Synthesis with Motion Controllability
Home Page: https://videocomposer.github.io
License: MIT License
There are examples of video inpainting on the project page, but I seem to be unable to achieve satisfactory inpainting results. Does the current version of code support video inpainting? Are there any script examples available?
Hi, I see that there is such a setting in the paper: Given a scretch or image, and then adding text to generate a video, have you tested this setting with some evaluation metrics, I don’t see it in the paper. In addition, for table 2 in the paper, I see that the frame consistency of the video generated by the given depth sequence (motion information) and text is not very high, is it because of jitter? If possible, can you share with me the IDs of the 1000 Webvid videos used for testing in table 2, and we may follow your work to compare our methods.
Very interesting job! When I used it, I found that the model could only generate videos with a resolution of 256. How should I modify the parameters to make the model generate videos with a higher resolution?
I tried changing the "resolution" parameter in the config, but it didn't work.
Looking forward to your reply, thank you!
Thank you for releasing your code.
Have you used guidance scaling to your noise while training?
After finetuning your model, the color tone of the generated video turned a bit yellow.
Any suggestion?
Thanks a lot. : )
I configured the corresponding environment and then ran the code, and I wonder why I get the following error because I didn't modify your code and follow the README.md documentation.
[2023-06-20 10:30:44,328] INFO: Created a model with 1347M parameters
[mpeg4 @ 0xeeaa0c40] Application has requested 128 threads. Using a thread count greater than 16 is not recommended.
/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True).
warnings.warn(
[2023-06-20 10:30:47,417] INFO: GPU Memory used 12.86 GB
Traceback (most recent call last):
File "run_net.py", line 36, in
main()
File "run_net.py", line 31, in main
inference_single(cfg.cfg_dict)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 351, in inference_single
worker(0, cfg)
File "/root/autodl-tmp/videocomposer/tools/videocomposer/inference_single.py", line 695, in worker
video_output = diffusion.ddim_sample_loop(
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 236, in ddim_sample_loop
xt, _ = self.ddim_sample(xt, t, model, model_kwargs, clamp, percentile, condition_fn, guide_scale, ddim_timesteps, eta)
File "/root/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 200, in ddim_sample
_, _, _, x0 = self.p_mean_variance(xt, t, model, model_kwargs, clamp, percentile, guide_scale)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 166, in p_mean_variance
var = _i(self.posterior_variance, t, xt)
File "/root/autodl-tmp/videocomposer/artist/ops/diffusion.py", line 13, in _i
return tensor[t].view(shape).to(x)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Do you plan to release the code for video generation using hand-crafted motion guidance?
If not, can you share the input data format of the hand-crafted motion and initial object position?
For instance, in the following image, the initial object position is depicted as a red star, and the motion is described as a red arrow.
How are they transformed to be a network input?
Use this red-black image as it is? or split the arrow into multiple pieces and use another image for the initial object position?
Thank you in advance.
When I use text = "Smiling woman in cowboy hat with wheat ears" and single image and single image as below :
and the execute python run_net.py\ --cfg configs/exp04_sketch2video_wo_style.yaml\ --seed 144\ --sketch_path "demo_video/hat_woman.png"\ --input_text_desc "Smiling woman in cowboy hat with wheat ears"
Then i get gif as follow, and is not same as paper, How can I get the result as the paper? Thank you
python 3.8.16
cuda 11.3
torch 1.12.0
gpu geforce rtx 3090
with flash-attn 0.2
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
failed to build when i tried pip install flash-attn==0.2
so i tried with newest version
with flash-attn 2.5.6
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/__init__.py", line 3, in <module> from flash_attn.flash_attn_interface import ( File "/home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module> import flash_attn_2_cuda as flash_attn_cuda ImportError: /home/cvnar1/anaconda3/envs/VideoComposer/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE
please let me know why this error is occured.
Thank you for the great repo.
When i try to run the code with the following command:
python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999
It seems that all conditional extractions have failed。
Could you tell me how to fix it ?
@Steven-SWZhang Thanks for the code. I'm getting the below error while running the code:
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
On below line:
May I know if I have to pass any parameter to run on gpu.
您好,请教下,论文里提到的用Laion-400M预训练,是指用Laion-400M对VideoComposer做了额外的预训练 ?如果是的话,预训练的输入组织方式,和参与训练的算法模块,可以讲解一下吗? 谢谢 ~
PS: 看代码里和Laion相关的有2个预训练模型,没有找到Laion-400M相关的,是不是我理解错了?
as the theme saying.
Do current code support to add hand-crafted motion condition? If so, can you share this configure and demo pictures?
Hi, has anyone faced this issue and/or knows a fix?
The proper GLIBC-Version can not be found. We have a newer one (2.35) installed on our image.
It occurs when trying to run the basic inference script:
python run_net.py\
--cfg configs/exp02_motion_transfer.yaml\
--seed 9999\
--input_video "demo_video/motion_transfer.mp4"\
--image_path "demo_video/moon_on_water.jpg"\
--input_text_desc "A beautiful big moon on the water at night"
Is 2.32 specifically necessary or could there be another issue?
ImportError: /usr/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found
I would like to ask this before considering downgrading GLIBC.
Thank you.
Hi @Steven-SWZhang , thanks for the awesome work.
Do you have any plans for releasing the training code?
We run the scripts for inference a video by the command python run_net.py \ --cfg configs/exp01_vidcomposer_full.yaml \ --input_video "demo_video/blackswan.mp4" \ --input_text_desc "A black swan swam in the water" \ --seed 9999
. We get the error as follows,
File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/paddlejob/workspace/lxz/videocomposer/tools/videocomposer/unet_sd.py", line 238, in forward out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op) File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 574, in memory_efficient_attention return op.forward_no_grad( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/xformers/ops.py", line 189, in forward_no_grad return cls.FORWARD_OPERATOR( File "/root/paddlejob/workspace/lxz/miniconda3/envs/VideoComposer/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__ return self._op(*args, **kwargs or {}) RuntimeError: CUDA error: no kernel image is available for execution on the device
The version of torch is the same as yours. The version of cuda is 11.3, and torch==1.12.0+cu113, torchvision==0.13.0+cu113. We use a V100, and when we execute nvidia-smi
the cuda version shown on V100 is 11.4. We think the version of our machine is compatible, and we do not know where the problem is.
Traceback (most recent call last):
File "run_net.py", line 37, in <module>
main()
File "run_net.py", line 32, in main
inference_single(cfg.cfg_dict)
File "/root/videocomposer/tools/videocomposer/inference_single.py", line 353, in inference_single
mp.spawn(worker, nprocs=cfg.gpus_per_machine, args=(cfg, ))
File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/miniconda3/envs/Paint-by-Example/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGSEGV
free(): invalid pointer
已放弃 (核心已转储)
Python==3.8.5
torch=='1.13.1+cu117'
transformers==4.19.2
xformers==0.0.16
No FlashAttetion
Hi,
awesome work! I wonder if you're going to release also the training code. I would like to finetune the model on my own data. Is it possible in the future?
Hello, do all control conditions share the same STC encoder model weight?
Hi Shiwei, @Steven-SWZhang
Thank you for publically making available the great work you have done.
I have been trying to reproduce the results for the task "video inpainting using a sequence of masks". More specifically, I have a video including 10 frames and 10 masks corresponding to those 10 frames of videos. So, I would like to feed the video alongside the sequence of masks and a text prompt to the model. So, I expect to get a temporally consistent video as an output in a way that the output video adheres to the sequence of masks and the text prompt.
However, I could not see any argument for the input mask. So, I went through the code, and as far as I understood, it seems that the code itself generates a random mask on the input video. The code below (inference_single.py) shows my explanation:
in which the function "make_masked_images" is:
So, as far as I realized, this "mask" variable in line 564 of the first snapshot is initialized with the "batch" variable (which comes from the dataloader) in the picture below:
So, when I went through the "dataset.py" code, I found out the mask is somehow randomly generated as the following:
So, my understanding is that the code only conditions the model on this randomly generated mask. So, if my understanding is correct, does it mean that we cannot feed an external sequence of masks to the model? If the understanding is not correct, I would appreciate it if you could explain how I can feed the sequence of masks to the model as I could not find anything in the code.
Thank you in advance for putting time into this case.
Kind Regards,
Amir
Can you tell me how to set up this function?
Hello, great work. May I ask when will the running script for “Text,style,hand-crafted motions and hand-carfted sketch” be released?
Thanks for your work!
when I run !python run_net.py
--cfg configs/exp01_vidcomposer_full.yaml
--input_video "demo_video/blackswan.mp4"
--input_text_desc "A black swan swam in the water"
--seed 9999
[2023-08-31 10:58:03,125] INFO: Loading ViT-H-14 model config.
[2023-08-31 10:58:15,720] WARNING: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.
Traceback (most recent call last):
File "/content/vc/run_net.py", line 36, in
main()
File "/content/vc/run_net.py", line 28, in main
inference_multi(cfg.cfg_dict)
File "/content/vc/tools/videocomposer/inference_multi.py", line 345, in inference_multi
worker(0, cfg)
File "/content/vc/tools/videocomposer/inference_multi.py", line 421, in worker
clip_encoder = FrozenOpenCLIPEmbedder(layer='penultimate',pretrained = DOWNLOAD_TO_CACHE(cfg.clip_checkpoint))
File "/content/vc/tools/videocomposer/inference_multi.py", line 108, in init
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=pretrained)
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 151, in create_model_and_transforms
model = create_model(
File "/usr/local/lib/python3.10/dist-packages/open_clip/factory.py", line 122, in create_model
raise RuntimeError(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
RuntimeError: Pretrained weights (/content/vc/model_weights/open_clip_pytorch_model.bin) not found for model ViT-H-14.
Hello, I cannot install motion-vector-extractor by pip or the source code since the arch of my computer may has some incompatible issues with ffmpeg, e.g., libavformat.
Is there any way that I can replace the motion-vector-extractor? maybe simply opencv default optical flow extractor?
By the way, great work!
My GPU is V100 32G;
When using the pre-trained model to inference, an error will often be killed, but sometimes it can run successfully, has anyone encountered this situation
[2023-08-11 13:09:34,289] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:09:47,112] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
[2023-08-11 13:09:52,142] INFO: Loading ViT-H-14 model config.
[2023-08-11 13:10:04,689] INFO: Loading pretrained ViT-H-14 weights (/root/autodl-tmp/videocomposer/model_weights/open_clip_pytorch_model.bin).
run_bash.sh: line 49: 2445 Killed python run_net.py --cfg configs/exp04_sketch2video_wo_style.yaml --seed 144 --sketch_path "demo_video/src_single_sketch.png" --input_text_desc "A little bird is standing on a branch"
Hi, guys, thx your wonderful work, but I'm intense curiosity about when will open your train script.
配置环境
D:\videocomposer>conda env create -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
Is there any plan to add keypoint or skeleton sequence based human video generation?
So it seems outputs are always 256x256 - is this a limitation of the model, or is that just a choice based on vram requirements? Would it be easy to enable other resolutions?
作者你好,看到你们的工作之后感觉非常有意思。请问本工作的训练计算资源要求是什么样的?例如说训练过程需要什么级别的显卡,多少张,训练多久?
Hi, thanks for sharing the great work. I woder if you could provide the inference code for image to video generation? Thanks!
Thank you for presenting such an exciting work. Congratulations!
I have a question regarding Table A3. Could you please provide more details on how the FVD is calculated? As this metric can be very sensitive to certain settings, I would like to know more about the resolution (256?), number of frames, and how you processed the captions. Additionally, I noticed that there are multiple correspondences to the same video. Could you please explain how you handled this?
Thank you!
Hello, great work. May I ask when will the running script for “sequence-to-video generation” such as "sketch sequence-to-video generation" and "Compositional depth sequence-to-video generation" be released?
Thanks for your work!
感觉你们的这个网络可以作为很多视频合成的后续优化处理模块,
请问有这部分原理及使用的相关细节介绍吗?
谢谢
另外也问一下 你们对后续添加outpainting功能有计划吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.