helioszhao / make-a-protagonist Goto Github PK

View Code? Open in Web Editor NEW

315.0 315.0 37.0 55.65 MB

Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

Home Page: https://make-a-protagonist.github.io/

License: Apache License 2.0

Python 91.46% C++ 1.36% Cuda 6.94% Shell 0.05% Cython 0.19%

make-a-protagonist's Introduction

Hi there, I'm Yuyang Zhao 👋

Contact Me:

✉️ Email: [email protected]

🔗 Website: https://yuyangzhao.com

🔎 Google Scholar: https://scholar.google.com/citations?user=u5M6XPAAAAAJ

make-a-protagonist's People

Contributors

Stargazers

Watchers

make-a-protagonist's Issues

Smooth video generation

The generated video is not temporally consistent, has flickering colours and changing backgrounds. Need suggestions to ensure temporal consistency and style consistency for all the frames of the video...

小黑子

Training very slow by increasing 1 frame

Hi @HeliosZhao
I see that using 28 frames takes 7-8 secs/iteration for training but increasing it to 29 frames takes 44 secs/iteration. It is strange. May I know what could have caused this. I see the 100% utilisation of GPU in both the cases, also see that memory is still left. And I'm training in A100 with 768*768 resolution. I just changed 'n_sample_frames' in config.

With 28 frames:

With 29 frames:

name '_C' is not defined

when i try the test: i got
python experts/grounded_sam_inference.py -d data/ikun/images/0000.jpg -t "a man with a basketball"
`/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
_IncompatibleKeys(missing_keys=[], unexpected_keys=['label_enc.weight'])
0%| | 0/1 [00:00<?, ?it/s]/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/transformers/modeling_utils.py:866: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
0%| | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/grounded_sam_inference.py", line 208, in
boxes_filt, pred_phrases, logits_filt = get_grounding_output(
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/grounded_sam_inference.py", line 72, in get_grounding_output
outputs = model(image[None], captions=[caption])
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py", line 313, in forward
hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 258, in forward
memory, memory_text = self.encoder(
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 576, in forward
output = checkpoint.checkpoint(
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 785, in forward
src2 = self.self_attn(
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 338, in forward
output = MultiScaleDeformableAttnFunction.apply(
File "/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py", line 53, in forward
output = _C.ms_deform_attn_forward(
NameError: name '_C' is not defined`

Math formula of DDIM Inversion with respect to v-parameterization

As the title describes, how to derive the math formula of DDIM Inversion with respect to v-parameterization in original paper?

license issue

This repository contains code from XMem, which is licensed under GPL-3.0.
Does it constitute the covered work of XMem? If so, I think this whole work could only be used under GPL-3.0, rather than apache-2.0. Unless there's an additional permission from XMem.

Is BLIP2 extracting the image description or the video description?

The official BLIP2 is description extraction for image, but your paper mentions description extraction for videos, can you explain how to do that exactly?

RuntimeError: Tensor type unknown to einops <class 'str'>

Hi,

Thanks for this repo! I want to ask whether you've encountered the following error before? Thank you!

Traceback (most recent call last):                                                                                                                                   | 0/50 [00:00<?, ?it/s]
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1726, in main
    pdb._runscript(mainpyfile)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1586, in _runscript
    self.run(statement)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/bdb.py", line 580, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/data/home/xindiw/Make-A-Protagonist/train.py", line 1, in <module>
    import argparse
  File "/data/home/xindiw/Make-A-Protagonist/train.py", line 470, in main
    sample = validation_pipeline(image=_ref_image, prompt=prompt, control_image=conditions, generator=generator, latents=ddim_inv_latent, image_embeds=image_embed, masks=masks, prior_latents=prior_embeds, prior_denoised_embeds=prior_denoised_embeds, **validation_data).videos
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in __call__
    down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
  File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in <listcomp>
    down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 483, in rearrange
    return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 412, in reduce
    return _apply_recipe(recipe, tensor, reduction_type=reduction)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 233, in _apply_recipe
    backend = get_backend(tensor)
  File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py", line 52, in get_backend
    raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
RuntimeError: Tensor type unknown to einops <class 'str'>
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py(52)get_backend()
-> raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))

train

Should we fine-tune the text-to-image diffusion models with visual and textual cluee for each video?

huggingface demo“Runtime error”

Hi,your huggingface demo is displaying“Runtime error”

Impact of masks and depth

What happens if I don't use the masks and depth during inference. What is the impact of masks/depth for generating video.

Frame to frame consistency

Thanks for the code. I have edited few videos and seen that the frame-frame consistency is not that smooth like the source video. And looks like the frame is not constant ( like the camera angle is shaking). Do you have any ideas to improve this?

after running "python eval.py --config="configs/<video_name>/eval.yaml"",Attention object argument after ** must be a mapping, not bool

Demo page turns out to be 'No interface is running right now'

The project demo page:https://60a373ddc34b235c06.gradio.live/
turns out to be 'No interface is running right now'.

What License is the code under?

请教一下，应用参考图片替换主体效果却不怎么成功，效果不是很好

我训练了yanzi模型，并把参考图片替换成高启强，并不能很好的控制生成高启强的视频效果。后来替换了川普图片，也不行，只有把promt 也改成特朗普，效果才可以.
我自己训练了一个39帧得动态表情（一只猫）做了两组实验：
（1）然后promot替换成不同的动物名字，如狗、狮子、老虎。reference指定同一张狗图片，然后生成的视频和promt的不同名词是对应的，没有受到参考图片的影响.
（2）然后找了几张对应的名字的图片作为参考图，但是生成的视频主体和参考图片不是一摸一样。

训练参数：图片高宽 480， frames = 39
推理：source_protagonist: false & source_background: false

ModuleNotFoundError: No module named 'groundingdino'

I get this module import error when I attempt to run the “python experts/grounded_sam_inference.py -d data/car-turn/images/0000.jpg -t suzuki jimny”

Multi GPU

I see few issues in running the code with multi gpu's. Is it supporting?

Motion vectors

Hi @HeliosZhao,

How can I use motion vectors ( extracted from FFMPEG) as a guidance for generation in inference. Like we use segmentation map in the current pipeline during the inference. And will it effect in positive way for video generation?

Windows version?

Will there be WINDOWS versions of this? I see some packages that the project depends on, such as triton, which only supports Linux systems.

Improvements in the generation

Hi @HeliosZhao

After going through the complete code and experiments, I see the following issues.

Issues with the generation quality and long video generation.
Certainly some part of the background from the source video is missing in the target video though we are using the masks and editing only the protagonist.

Plans to improve:

Can we use temporalNet from ControlNet as a guidance to improve the consistency.
Can we use pretrained text to video models or train this architecture on videos dataset to better learn the patterns. As the current one used is T2I ( Text to Image) the frame-to-frame consistency is low compare to the source videos.
Can we use any other additional guidance for the better generation?
Can we use weighted temporal attention, I see we calculate the attention with single frame. Can we use moving weighted average so that the information is preserved ( RNN kind of architecture here).

Can you help answering these? Thanks in advance.

progressbar version？

(Text-video) E:\git\Make-A-Protagonist>python experts/xmem_inference.py -d data/bird-forest/images -v bird-forest --mask_dir bird.mask
Output path not provided. By default saving to the mask dir
save_all is forced to be true in generic evaluation mode.
Hyperparameters read from the model weights: C^k=64, C^v=512, C^h=64
Single object mode: False
Traceback (most recent call last):
File "E:\git\Make-A-Protagonist\experts\xmem_inference.py", line 103, in
for vid_reader in progressbar(meta_loader, max_value=len(meta_dataset), redirect_stdout=True):
TypeError: 'module' object is not callable

(Text-video) E:\git\Make-A-Protagonist>pip list|grep progressbar
progressbar 2.5

Reference image in training

@HeliosZhao Why the reference image is been used in training and does it make any significant difference if I use a different masked_reference image during the inference? If that doesn't make any difference then what is the use of reference image in training?

helioszhao / make-a-protagonist Goto Github PK

make-a-protagonist's Introduction

Hi there, I'm Yuyang Zhao 👋

make-a-protagonist's People

Contributors

Stargazers

Watchers

Forkers

make-a-protagonist's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs