Würstchen is a new framework for training text-conditional models by moving the computationally expensive text-conditional stage into a highly compressed latent space. Common approaches make use of a single stage compression, while Würstchen introduces another Stage that introduces even more compression. In total we have Stage A & B that are responsible for compressing images and Stage C that learns the text-conditional part in the low dimensional latent space. With that Würstchen achieves a 42x compression factor, while still reconstructing images faithfully. This enables training of Stage C to be fast and computationally cheap. We refer to the paper for details.
You can use the model simply through the notebooks here. The Stage B notebook only for reconstruction and the Stage C notebook is for the text-conditional generation. You can also try the text-to-image generation on Google Colab.
Training Würstchen is considerably faster and cheaper than other text-to-image as it trains in a much smaller latent space of 12x12. We provide training scripts for both Stage B and Stage C.
Model | Download | Parameters | Conditioning |
---|---|---|---|
Würstchen v1 | Huggingface | 1B (Stage C) + 600M (Stage B) + 19M (Stage A) | CLIP-H-Text |
Special thanks to Stability AI for providing compute for our research.