Hi there, I'm Yuyang Zhao 👋
Contact Me:
✉️ Email: [email protected]
🔗 Website: https://yuyangzhao.com
🔎 Google Scholar: https://scholar.google.com/citations?user=u5M6XPAAAAAJ
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
Home Page: https://make-a-protagonist.github.io/
License: Apache License 2.0
Contact Me:
✉️ Email: [email protected]
🔗 Website: https://yuyangzhao.com
🔎 Google Scholar: https://scholar.google.com/citations?user=u5M6XPAAAAAJ
The generated video is not temporally consistent, has flickering colours and changing backgrounds. Need suggestions to ensure temporal consistency and style consistency for all the frames of the video...
Hi @HeliosZhao
I see that using 28 frames takes 7-8 secs/iteration for training but increasing it to 29 frames takes 44 secs/iteration. It is strange. May I know what could have caused this. I see the 100% utilisation of GPU in both the cases, also see that memory is still left. And I'm training in A100 with 768*768 resolution. I just changed 'n_sample_frames' in config.
With 28 frames:
when i try the test: i got
python experts/grounded_sam_inference.py -d data/ikun/images/0000.jpg -t "a man with a basketball"
`/data2/home/srchen/project/github/in_work/Make-A-Protagonist/experts/GroundedSAM/GroundingDINO/groundingdino/models/GroundingDINO/ms_deform_attn.py:31: UserWarning: Failed to load custom C++ ops. Running on CPU mode Only!
warnings.warn("Failed to load custom C++ ops. Running on CPU mode Only!")
/root/miniconda3/envs/make_a_ptotagonist/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
device
argument is deprecated and will be removed in v5 of Transformers.As the title describes, how to derive the math formula of DDIM Inversion with respect to v-parameterization in original paper?
This repository contains code from XMem, which is licensed under GPL-3.0.
Does it constitute the covered work of XMem? If so, I think this whole work could only be used under GPL-3.0, rather than apache-2.0. Unless there's an additional permission from XMem.
The official BLIP2 is description extraction for image, but your paper mentions description extraction for videos, can you explain how to do that exactly?
Hi,
Thanks for this repo! I want to ask whether you've encountered the following error before? Thank you!
Traceback (most recent call last): | 0/50 [00:00<?, ?it/s]
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1726, in main
pdb._runscript(mainpyfile)
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/pdb.py", line 1586, in _runscript
self.run(statement)
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/bdb.py", line 580, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/data/home/xindiw/Make-A-Protagonist/train.py", line 1, in <module>
import argparse
File "/data/home/xindiw/Make-A-Protagonist/train.py", line 470, in main
sample = validation_pipeline(image=_ref_image, prompt=prompt, control_image=conditions, generator=generator, latents=ddim_inv_latent, image_embeds=image_embed, masks=masks, prior_latents=prior_embeds, prior_denoised_embeds=prior_denoised_embeds, **validation_data).videos
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in __call__
down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
File "/data/home/xindiw/Make-A-Protagonist/makeaprotagonist/pipelines/pipeline_stable_unclip_controlavideo.py", line 1433, in <listcomp>
down_block_res_samples = [rearrange(sample, "(b f) c h w -> b c f h w", f=video_length) for sample in down_block_res_samples]
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 483, in rearrange
return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 412, in reduce
return _apply_recipe(recipe, tensor, reduction_type=reduction)
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/einops.py", line 233, in _apply_recipe
backend = get_backend(tensor)
File "/data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py", line 52, in get_backend
raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
RuntimeError: Tensor type unknown to einops <class 'str'>
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /data/home/xindiw/miniconda3/envs/tune/lib/python3.9/site-packages/einops/_backends.py(52)get_backend()
-> raise RuntimeError('Tensor type unknown to einops {}'.format(type(tensor)))
Should we fine-tune the text-to-image diffusion models with visual and textual cluee for each video?
Hi,your huggingface demo is displaying“Runtime error”
What happens if I don't use the masks and depth during inference. What is the impact of masks/depth for generating video.
Thanks for the code. I have edited few videos and seen that the frame-frame consistency is not that smooth like the source video. And looks like the frame is not constant ( like the camera angle is shaking). Do you have any ideas to improve this?
The project demo page:https://60a373ddc34b235c06.gradio.live/
turns out to be 'No interface is running right now'.
训练参数:图片高宽 480, frames = 39
推理:source_protagonist: false & source_background: false
I get this module import error when I attempt to run the “python experts/grounded_sam_inference.py -d data/car-turn/images/0000.jpg -t suzuki jimny”
I see few issues in running the code with multi gpu's. Is it supporting?
Hi @HeliosZhao,
How can I use motion vectors ( extracted from FFMPEG) as a guidance for generation in inference. Like we use segmentation map in the current pipeline during the inference. And will it effect in positive way for video generation?
Will there be WINDOWS versions of this? I see some packages that the project depends on, such as triton, which only supports Linux systems.
Hi @HeliosZhao
After going through the complete code and experiments, I see the following issues.
Plans to improve:
Can you help answering these? Thanks in advance.
(Text-video) E:\git\Make-A-Protagonist>python experts/xmem_inference.py -d data/bird-forest/images -v bird-forest --mask_dir bird.mask
Output path not provided. By default saving to the mask dir
save_all is forced to be true in generic evaluation mode.
Hyperparameters read from the model weights: C^k=64, C^v=512, C^h=64
Single object mode: False
Traceback (most recent call last):
File "E:\git\Make-A-Protagonist\experts\xmem_inference.py", line 103, in
for vid_reader in progressbar(meta_loader, max_value=len(meta_dataset), redirect_stdout=True):
TypeError: 'module' object is not callable
(Text-video) E:\git\Make-A-Protagonist>pip list|grep progressbar
progressbar 2.5
@HeliosZhao Why the reference image is been used in training and does it make any significant difference if I use a different masked_reference image during the inference? If that doesn't make any difference then what is the use of reference image in training?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.