mikubill / naifu Goto Github PK
View Code? Open in Web Editor NEWTrain generative models with pytorch lightning
License: MIT License
Train generative models with pytorch lightning
License: MIT License
The following error occurred after opening cache_latents
diffusers==0.17.0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:612: UserWarning: Checkpoint directory /home/ubuntu/checkpoint exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Loading captions: 967it [00:00, 6004.40it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 967it [00:09, 102.76it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 2e-06
| Name | Type | Params
------------------------------------------------------
0 | unet | UNet2DConditionModel | 859 M
1 | vae | AutoencoderKL | 83.7 M
2 | text_encoder | CLIPTextModel | 123 M
------------------------------------------------------
859 M Trainable params
206 M Non-trainable params
1.1 B Total params
2,132.471 Total estimated model params size (MB)
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/fabric/utilities/data.py:63: UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
rank_zero_warn(
Training: 0it [00:00, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/naifu-diffusion/trainer.py:116 in <module> │
│ │
│ 113 │
│ 114 if __name__ == "__main__": │
│ 115 │ args = parse_args() │
│ ❱ 116 │ main(args) │
│ 117 │
│ │
│ /home/ubuntu/naifu-diffusion/trainer.py:112 in main │
│ │
│ 109 │ │
│ 110 │ config, callbacks = pl_compat_fix(config, callbacks) │
│ 111 │ trainer = pl.Trainer(logger=logger, callbacks=callbacks, strategy=strategy, plugins= │
│ ❱ 112 │ trainer.fit(model=model, ckpt_path=args.resume if args.resume else None) │
│ 113 │
│ 114 if __name__ == "__main__": │
│ 115 │ args = parse_args() │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:608 in fit │
│ │
│ 605 │ │ if not isinstance(model, pl.LightningModule): │
│ 606 │ │ │ raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model. │
│ 607 │ │ self.strategy._lightning_module = model │
│ ❱ 608 │ │ call._call_and_handle_interrupt( │
│ 609 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │
│ 610 │ │ ) │
│ 611 │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:3 │
│ 8 in _call_and_handle_interrupt │
│ │
│ 35 │ │ if trainer.strategy.launcher is not None: │
│ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, │
│ 37 │ │ else: │
│ ❱ 38 │ │ │ return trainer_fn(*args, **kwargs) │
│ 39 │ │
│ 40 │ except _TunerExitException: │
│ 41 │ │ trainer._call_teardown_hook() │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:650 in _fit_impl │
│ │
│ 647 │ │ │ model_provided=True, │
│ 648 │ │ │ model_connected=self.lightning_module is not None, │
│ 649 │ │ ) │
│ ❱ 650 │ │ self._run(model, ckpt_path=self.ckpt_path) │
│ 651 │ │ │
│ 652 │ │ assert self.state.stopped │
│ 653 │ │ self.training = False │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1103 in _run │
│ │
│ 1100 │ │ │
│ 1101 │ │ self._checkpoint_connector.resume_end() │
│ 1102 │ │ │
│ ❱ 1103 │ │ results = self._run_stage() │
│ 1104 │ │ │
│ 1105 │ │ log.detail(f"{self.__class__.__name__}: trainer tearing down") │
│ 1106 │ │ self._teardown() │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1182 in _run_stage │
│ │
│ 1179 │ │ │ return self._run_evaluate() │
│ 1180 │ │ if self.predicting: │
│ 1181 │ │ │ return self._run_predict() │
│ ❱ 1182 │ │ self._run_train() │
│ 1183 │ │
│ 1184 │ def _pre_training_routine(self) -> None: │
│ 1185 │ │ # wait for all to join if on distributed │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1205 in _run_train │
│ │
│ 1202 │ │ self.fit_loop.trainer = self │
│ 1203 │ │ │
│ 1204 │ │ with torch.autograd.set_detect_anomaly(self._detect_anomaly): │
│ ❱ 1205 │ │ │ self.fit_loop.run() │
│ 1206 │ │
│ 1207 │ def _run_evaluate(self) -> _EVALUATE_OUTPUT: │
│ 1208 │ │ assert self.evaluating │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py:194 │
│ in run │
│ │
│ 191 │ │ │
│ 192 │ │ self.reset() │
│ 193 │ │ │
│ ❱ 194 │ │ self.on_run_start(*args, **kwargs) │
│ 195 │ │ │
│ 196 │ │ while not self.done: │
│ 197 │ │ │ try: │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py │
│ :218 in on_run_start │
│ │
│ 215 │ │ self._results.to(device=self.trainer.lightning_module.device) │
│ 216 │ │ │
│ 217 │ │ self.trainer._call_callback_hooks("on_train_start") │
│ ❱ 218 │ │ self.trainer._call_lightning_module_hook("on_train_start") │
│ 219 │ │ self.trainer._call_strategy_hook("on_train_start") │
│ 220 │ │
│ 221 │ def on_advance_start(self) -> None: │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1347 in _call_lightning_module_hook │
│ │
│ 1344 │ │ pl_module._current_fx_name = hook_name │
│ 1345 │ │ │
│ 1346 │ │ with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{ho │
│ ❱ 1347 │ │ │ output = fn(*args, **kwargs) │
│ 1348 │ │ │
│ 1349 │ │ # restore current_fx when nested context │
│ 1350 │ │ pl_module._current_fx_name = prev_fx_name │
│ │
│ /home/ubuntu/naifu-diffusion/lib/model.py:288 in on_train_start │
│ │
│ 285 │ │ │ self.ema.to(self.device, dtype=self.unet.dtype) │
│ 286 │ │ │
│ 287 │ │ if self.use_latent_cache: │
│ ❱ 288 │ │ │ self.dataset.cache_latents(self.vae, self.data_sampler.buckets if self.confi │
│ 289 │ │
│ 290 │ def on_train_epoch_start(self) -> None: │
│ 291 │ │ if self.use_latent_cache: │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/torch/nn/modules/module.py:1614 in │
│ __getattr__ │
│ │
│ 1611 │ │ │ modules = self.__dict__['_modules'] │
│ 1612 │ │ │ if name in modules: │
│ 1613 │ │ │ │ return modules[name] │
│ ❱ 1614 │ │ raise AttributeError("'{}' object has no attribute '{}'".format( │
│ 1615 │ │ │ type(self).__name__, name)) │
│ 1616 │ │
│ 1617 │ def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'StableDiffusionModel' object has no attribute 'data_sampler'
python convert_to_sd.py --src pruned/dreamshaper_332BakedVaeClipFix.ckpt --dst dream_sim.ckpt
it just shows errors:
Traceback (most recent call last):
File "K:\3\convert_to_sd.py", line 345, in <module>
unet_state_dict = convert_unet_state_dict(unet_state_dict, is_v2)
File "K:\3\convert_to_sd.py", line 107, in convert_unet_state_dict
new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
File "K:\3\convert_to_sd.py", line 107, in <dictcomp>
new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
KeyError: 'time_embedding.linear_1.weight'
Hello, I don't see a discussion panel, so I apologize for asking this on the issues tab.
I have 2 questions:
I successfully used your script to output a 2GB ckpt. It works well. If I use this model as a base to train in dreambooth, does the resulting model keep the benefits from your script?
How do I link the nsfw model to this script? I can upload my nai-full-pruned to the colab folders through my gdrive, but I am having a hard time running it through the script. Could you help? @Mikubill
When trying to continue training a model started with use_ema: True, the trainer will crash.
Error:
AttributeError: 'StableDiffusionModel' object has no attribute 'ema'
Get error when running this. Tried 3 times with nothing but time wasted.
859 M Trainable params
206 M Non-trainable params
1.1 B Total params
4,264.941 Total estimated model params size (MB)
Epoch 0: 100% 575/575 [31:47<00:00, 3.32s/it, loss=0.263]tcmalloc: large alloc 1134575616 bytes == 0x17a8de000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 1418223616 bytes == 0x2be0a000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 1772781568 bytes == 0x7ef66a558000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 2215976960 bytes == 0x7ef5e6406000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 2769977344 bytes == 0x7ef54125e000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 3462471680 bytes == 0x7ef5e6406000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 4328095744 bytes == 0x7ef43f2c6000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 5410119680 bytes == 0x7ef54125e000 @ 0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
it seems that there is a RAM error when saving the checkpoint
I running with 'python trainer.py train_sd15.yaml --resume', but training still begins from 0.
Is there any way to continue training from a break point?
Thanks!
I wanted to give DPO training a try but there seems to be multiple issues with the PairedDataset.
reso and interp are not assigned to self in init but self.reso
is being used later in the file
Line 54 in af5c34b
Getting this error when using the default train_dpo.yaml config:
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/workspace/naifu/data/paired_wds.py", line 63, in __getitem__
example = self.preprocess_train(example)
File "/workspace/naifu/data/paired_wds.py", line 33, in preprocess_train
images = [
File "/workspace/naifu/data/paired_wds.py", line 34, in <listcomp>
Image.open(io.BytesIO(im_bytes)).convert("RGB")
TypeError: a bytes-like object is required, not 'int'```
I manually set batchsize in buckets.py, because the config does not do anything, then it crashes with:
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File
"/home/bunny/miniconda3/envs/nai/lib/python3.8/site-packages/torch/utils/data/_u
tils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File
"/home/bunny/miniconda3/envs/nai/lib/python3.8/site-packages/torch/utils/data/_u
tils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/media/bunny/D612FBE112FBC511/FinetuneV6/data/store.py", line 189, in
collate_fn
z.append(torch.asarray([[self.tokenizer.bos_token_id] + x[:75] +
[self.tokenizer.eos_token_id] for x in tokens]))
ValueError: expected sequence of length 61 at dim 1 (got 34)
It used to work before this commit cc5b063
I tested with a modified colab notebook based on your example. The notebook is here . It installs requirement_tpu and runs using default tpu config.
Got error:
/content/naifu-diffusion
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/animesfw.tgz" to /tmp/model
100% 3.58G/3.58G [02:16<00:00, 28.2MB/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0
100% 52.7M/52.7M [00:02<00:00, 19.2MB/s]
Loading resolutions: 34it [00:00, 992.53it/s]
Loading captions: 68it [00:00, 26123.16it/s]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:452: LightningDeprecationWarning: Setting `Trainer(tpu_cores=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='tpu', devices=8)` instead.
f"Setting `Trainer(tpu_cores={tpu_cores!r})` is deprecated in v1.7 and will be removed"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:467: UserWarning: The flag `devices=-1` will be ignored, instead the device specific number 8 will be used
f"The flag `devices={devices}` will be ignored, "
/usr/local/lib/python3.7/dist-packages/lightning_lite/accelerators/cuda.py:159: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 296) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
trainer.py FAILED
----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-11-19_18:29:43
host : 21021965512a
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 296)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 296
====================================================
I'm not sure if this feature is already planned but I thought I'd bring it up.
Since the default model downloaded is one that I assume was trained with a certain custom VAE, it might be good to have a way to use it for training as well.
Just thinking this might possibly fix issues I've been having in my tests with hand details being lost compared to the pre-finetuned model.
I started the training on one machine (works perfectly fine), but when I start the other machine, it gets stuck in downloading parameters.
Here's the output I get:
/home/ai/.virtualenvs/DMshareGenesis/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:151: FutureWarning: The configuration file of the unet has set the default `sample_size` to smaller than 64 which seems highly unlikely. If your checkpoint is a fine-tuned version of any of the following:
- CompVis/stable-diffusion-v1-4
- CompVis/stable-diffusion-v1-3
- CompVis/stable-diffusion-v1-2
- CompVis/stable-diffusion-v1-1
- runwayml/stable-diffusion-v1-5
- runwayml/stable-diffusion-inpainting
you should change 'sample_size' to 64 in the configuration file. Please make sure to update the config accordingly as leaving `sample_size=32` in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the `unet/config.json` file
deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0
100%|██████████| 52.7M/52.7M [00:10<00:00, 5.39MB/s]
You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Loading captions: 34it [00:00, 11746.82it/s]
Loading resolutions: 0it [00:00, ?it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 34it [00:00, 1118.48it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 5e-06
| Name | Type | Params
------------------------------------------------------
0 | text_encoder | CLIPTextModel | 123 M
1 | vae | AutoencoderKL | 83.7 M
2 | unet | UNet2DConditionModel | 859 M
------------------------------------------------------
859 M Trainable params
206 M Non-trainable params
1.1 B Total params
4,264.941 Total estimated model params size (MB)
Epoch 0: 0%| | 0/34 [00:00<?, ?it/s] Found per machine batch size automatically from the batch: 1
Jan 05 07:38:13.862 [INFO] Found no active peers: None
Jan 05 07:38:14.991 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, provide initialize_optimizer=False
Jan 05 07:38:16.819 [INFO] Downloading parameters from peer QmVSfugP26MS1qUBWxXG1RKhZpGkodQuFoM4HFhqTc4mwj
This is potentially a Hivemind issue, but I wanted to check whether anyone has encoutered this issue before.
Training with the train_sdxl configs outputs checkpoints which are the size of the SDXL unet(10.3 GB) in diffusers format.
How do I use these files for anything?
you should change 'sample_size' to 64 in the configuration file. Please make sure to update the config accordingly as leaving sample_size=32
in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the unet/config.json
file
deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /content/naifu-diffusion/lightning_logs
Loading images: 34it [00:00, 1461.33it/s]
BucketManager initialized with base_res = (512, 512), max_size = (768, 512)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 7.071067811865476e-06
859 M Trainable params
206 M Non-trainable params
1.1 B Total params
4,264.941 Total estimated model params size (MB)
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argumentto
num_workers=1in the
DataLoaderto improve performance. /usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/data.py:121: Your
IterableDatasethas
lendefined. In combination with multi-process data loading (when num_workers > 1),
len` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
Epoch 0: 0% 0/17 [00:00<?, ?it/s] Traceback (most recent call last):
File "/content/naifu-diffusion/trainer.py", line 125, in
main(args)
File "/content/naifu-diffusion/trainer.py", line 121, in main
trainer.fit(model=model, ckpt_path=args.resume if args.resume else None)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
can anyone tell me :(
pytorch==2.0.1+cu118,diffusers== 0.17.0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
╭─────────────────── Traceback (most recent call last) ────────────────────╮
│ /home/ubuntu/naifu-diffusion/trainer.py:119 in <module> │
│ │
│ 116 │
│ 117 if __name__ == "__main__": │
│ 118 │ args = parse_args() │
│ ❱ 119 │ main(args) │
│ 120 │
│ │
│ /home/ubuntu/naifu-diffusion/trainer.py:115 in main │
│ │
│ 112 │ │
│ 113 │ config, callbacks = pl_compat_fix(config, callbacks) │
│ 114 │ trainer = pl.Trainer(logger=logger, callbacks=callbacks, strat │
│ ❱ 115 │ trainer.fit(model=model, ckpt_path=args.resume if args.resume │
│ 116 │
│ 117 if __name__ == "__main__": │
│ 118 │ args = parse_args() │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/trainer.py:608 in fit │
│ │
│ 605 │ │ if not isinstance(model, pl.LightningModule): │
│ 606 │ │ │ raise TypeError(f"`Trainer.fit()` requires a `Lightni │
│ 607 │ │ self.strategy._lightning_module = model │
│ ❱ 608 │ │ call._call_and_handle_interrupt( │
│ 609 │ │ │ self, self._fit_impl, model, train_dataloaders, val_d │
│ 610 │ │ ) │
│ 611 │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/call.py:38 in _call_and_handle_interrupt │
│ │
│ 35 │ │ if trainer.strategy.launcher is not None: │
│ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, *ar │
│ 37 │ │ else: │
│ ❱ 38 │ │ │ return trainer_fn(*args, **kwargs) │
│ 39 │ │
│ 40 │ except _TunerExitException: │
│ 41 │ │ trainer._call_teardown_hook() │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/trainer.py:638 in _fit_impl │
│ │
│ 635 │ │ │ ) │
│ 636 │ │ │
│ 637 │ │ # links data to the trainer │
│ ❱ 638 │ │ self._data_connector.attach_data( │
│ 639 │ │ │ model, train_dataloaders=train_dataloaders, val_datal │
│ 640 │ │ ) │
│ 641 │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/connectors/data_connector.py:148 in attach_data │
│ │
│ 145 │ │ │ _check_dataloader_none(predict_dataloaders, self._pred │
│ 146 │ │ │
│ 147 │ │ # set local properties on the model │
│ ❱ 148 │ │ self._copy_trainer_model_properties(model) │
│ 149 │ │
│ 150 │ def _copy_trainer_model_properties(self, model: "pl.LightningM │
│ 151 │ │ model.trainer = proxy(self.trainer) │
│ │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/connectors/data_connector.py:153 in │
│ _copy_trainer_model_properties │
│ │
│ 150 │ def _copy_trainer_model_properties(self, model: "pl.LightningM │
│ 151 │ │ model.trainer = proxy(self.trainer) │
│ 152 │ │ # for backward compatibility │
│ ❱ 153 │ │ model.precision = int(self.trainer.precision) if self.trai │
│ 154 │ │
│ 155 │ def attach_dataloaders( │
│ 156 │ │ self, │
╰──────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: '16-true'
Looking at this project a few questions come to my mind that are unanswered in the readme.
I used naifu-diffusion to train the model, and the trained model was converted with convert_to_sd.py, and this error occurred.
xformers==0.0.20,diffusers == 0.17.0 ,torch==2.0.1,
diffusers ==0.10.2 cannot be used
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Loading captions: 3it [00:00, 2874.78it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 3it [00:00, 72.47it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 3e-06
| Name | Type | Params
------------------------------------------------------
0 | unet | UNet2DConditionModel | 859 M
1 | vae | AutoencoderKL | 83.7 M
2 | text_encoder | CLIPTextModel | 123 M
------------------------------------------------------
859 M Trainable params
206 M Non-trainable params
1.1 B Total params
2,132.471 Total estimated model params size (MB)
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning_lite/utilities/data.py:63: UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
rank_zero_warn(
Epoch 0: 0%| | 0/3 [00:00<?, ?it/s]/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
Epoch 0: 100%|██████████████████████████████████████| 3/3 [00:04<00:00, 1.38s/it, loss=0.114]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████████████████████████████████| 3/3 [00:27<00:00, 9.29s/it, loss=0.114]
Traceback (most recent call last):
File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 345, in <module>
unet_state_dict = convert_unet_state_dict(unet_state_dict, is_v2)
File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 107, in convert_unet_state_dict
new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 107, in <dictcomp>
new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
KeyError: 'time_embedding.linear_1.weight'
Ran the conda environment, and when using both the multi-gpu and test configs, with the commands supplied in a fresh environment, it always crashes with this error. There's also a fairly length traceback I can send if you can't replicate the issue on your end.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.