Comments (3)
Incorporating elements similar to SORA into this architecture should be feasible: https://openai.com/research/video-generation-models-as-world-simulators. This would involve adding the time dimension to the patches. But probably incorporating text prompts would also be a big step.
I'm currently working on implementing these ideas in my free time and would welcome collaboration. If you're interested in exploring this journey of creating an open source SORA with me, please join the discussion on Discord: https://discord.gg/2WZMXfnq
from dit.
Hi the link is expired, and can you share the discord invite again?
from dit.
Yes i'd be interested in collaboration too. I already setup a ViVit ( video vision transformer) architecture with this DiT as a reference.
If you look at Sora they also reference the ViVit paper.
To incorporate text you would add a cross attention layer for both spatial and temporal DiT blocks.
in ViviT they use factorized attention ( a spatial followed by a temporal block). And utilizing 3D conv embeddings. Pretty standard for ViviT I don't believe Sora is any different.
They use CLIP for text embeddings and prompts.
@FeSens I recommend you look at this
https://arxiv.org/abs/2103.15691
and Googles Jax implementation
However, the compute required for this even in latent space is pretty large. Also, I think you will need to use the positional encoding of a pre-trained ViT for this because its very hard to find labeled video data that is annotated. So you will need to train less data with some pre-trained weights.
from dit.
Related Issues (20)
- [Question] Why DiT-XL/2 takes 119 GFlops to generate 256x256 images? HOT 3
- Giving Prompt instead of classes HOT 1
- Prompt-conditioning model instead of class-conditioning HOT 5
- RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul
- How do you calculate flops? HOT 4
- image generation label doesn't match validation label
- How to condition on an image? HOT 1
- sample_ddp failed (CUDA error: device-side assert triggered) HOT 2
- Request to adapt to Ascend NPU
- possible bug for sampling script: y_null = torch.tensor([1000] * n) HOT 1
- time embedding use cat[cos, sin]
- Bugs Fixing and Supporting for Multi-nodes HOT 5
- Green image during inference.
- Clarification on Zero Initialization in FinalLayer of DiT Model HOT 2
- Do the pre-trained DiT chekpoints contain EMA weights? HOT 1
- DiT results on CIFAR10 HOT 3
- training batch
- Performance for patch size = 1
- about fused_attention
- The model could not be fitted if not predict xstart HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dit.