Comments (5)
i'm not sure which model you're training with it, but it looks like you're running into the classic problem with DDP training, aka Distributed Data Parallel.
this style of multi-GPU training runs a single instance of the trainer on each GPU, and loads everything equivalently on both. this means when using 2x 16G GPUs you don't have access to 32G, but just 2x 16G.
what you're looking for to split across two GPUs is called FSDP, fully sharded data parallel, which effectively splits layers and has a high communication overhead between GPUs. this kind of thing benefits from nvlink a lot more and also isn't supported in the Diffusers example trainers, or really any publicly accessible diffusion training toolkit that i'm aware of.
from diffusers.
Hello, I am using stable-diffusion 2.1 as the model. FSDP is not supported in stable diffusion? Is there any alternate way to train the model?
from diffusers.
FSDP isn't supported by pytorch in general
you need GPUs with more VRAM, and in my experience GCP is one of teh most expensive routes to do this.
from diffusers.
We have to use GCP in the office as there's no access to physical GPUs. Even with accelerate
or --multi-gpu
we can't run the pytorch models on GCP?
from diffusers.
what i meant is the 16gb gpu through GCP is not as cost-effective as other platforms like Vast or RunPod where you can likely rent a single 48gb gpu for less than a dual 16gb instance on GCP
you can possibly get away with a low rank (LoRA) training on the two 16gb devices but as they lack intrinsic bf16 support (iirc) they are limited in utility
from diffusers.
Related Issues (20)
- Depth pipeline gives error at fp16 HOT 1
- When setting ip_adapter_scale with a dict and ipadapter plus TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor' HOT 4
- Wrong learning rate scheduler training step count for examples with multi-gpu when setting `--num_train_epochs` HOT 7
- SDXL LoRA key not loaded "lora_te2_text_projection.*" (LoRA from OneTrainer) HOT 5
- Abnormal size for LEDITS++ model ? HOT 1
- Train stopped at 0% HOT 8
- FIFO-Diffusion: Generating Infinite Videos from Text without Training through Rolling Video Denoising
- Consistency training fails to converge HOT 4
- do we need internet everytime to run single_file? HOT 5
- Potentially wrong scheduler in train_text_to_image_sdxl.py? HOT 13
- StableDiffusionPipeline fails when text_encoder=None HOT 2
- DPM solver ++ third order with SDE HOT 7
- is diffusers pipline support emphasis? HOT 2
- [Training] Resume checkpoint global step inconsistent/confusion across scripts HOT 2
- `TCDScheduler` : `eta` out of range?
- Diffusion Transformers License HOT 2
- AttributeError: module diffusers has no attribute StableCascadeUNet HOT 5
- Add Conditional Diffusion Distillation
- [`Docs`] JAX/Flax Installation HOT 1
- `AsymmetricAutoencoderKL`: missing `generator` argument in `decode()` called from `forward()` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffusers.