mikubill / naifu Goto Github PK

View Code? Open in Web Editor NEW

273.0 273.0 37.0 8.54 MB

Train generative models with pytorch lightning

License: MIT License

Python 98.89% Jupyter Notebook 1.11%

naifu's People

Contributors

Stargazers

Watchers

naifu's Issues

AttributeError: 'StableDiffusionModel' object has no attribute 'data_sampler'

The following error occurred after opening cache_latents
diffusers==0.17.0

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:612: UserWarning: Checkpoint directory /home/ubuntu/checkpoint exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Loading captions: 967it [00:00, 6004.40it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 967it [00:09, 102.76it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 2e-06

  | Name         | Type                 | Params
------------------------------------------------------
0 | unet         | UNet2DConditionModel | 859 M 
1 | vae          | AutoencoderKL        | 83.7 M
2 | text_encoder | CLIPTextModel        | 123 M 
------------------------------------------------------
859 M     Trainable params
206 M     Non-trainable params
1.1 B     Total params
2,132.471 Total estimated model params size (MB)
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/fabric/utilities/data.py:63: UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
  rank_zero_warn(
Training: 0it [00:00, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ubuntu/naifu-diffusion/trainer.py:116 in <module>                                          │
│                                                                                                  │
│   113                                                                                            │
│   114 if __name__ == "__main__":                                                                 │
│   115 │   args = parse_args()                                                                    │
│ ❱ 116 │   main(args)                                                                             │
│   117                                                                                            │
│                                                                                                  │
│ /home/ubuntu/naifu-diffusion/trainer.py:112 in main                                              │
│                                                                                                  │
│   109 │                                                                                          │
│   110 │   config, callbacks = pl_compat_fix(config, callbacks)                                   │
│   111 │   trainer = pl.Trainer(logger=logger, callbacks=callbacks, strategy=strategy, plugins=   │
│ ❱ 112 │   trainer.fit(model=model, ckpt_path=args.resume if args.resume else None)               │
│   113                                                                                            │
│   114 if __name__ == "__main__":                                                                 │
│   115 │   args = parse_args()                                                                    │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:608 in fit                                                                                     │
│                                                                                                  │
│    605 │   │   if not isinstance(model, pl.LightningModule):                                     │
│    606 │   │   │   raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.  │
│    607 │   │   self.strategy._lightning_module = model                                           │
│ ❱  608 │   │   call._call_and_handle_interrupt(                                                  │
│    609 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    610 │   │   )                                                                                 │
│    611                                                                                           │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:3 │
│ 8 in _call_and_handle_interrupt                                                                  │
│                                                                                                  │
│   35 │   │   if trainer.strategy.launcher is not None:                                           │
│   36 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,     │
│   37 │   │   else:                                                                               │
│ ❱ 38 │   │   │   return trainer_fn(*args, **kwargs)                                              │
│   39 │                                                                                           │
│   40 │   except _TunerExitException:                                                             │
│   41 │   │   trainer._call_teardown_hook()                                                       │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:650 in _fit_impl                                                                               │
│                                                                                                  │
│    647 │   │   │   model_provided=True,                                                          │
│    648 │   │   │   model_connected=self.lightning_module is not None,                            │
│    649 │   │   )                                                                                 │
│ ❱  650 │   │   self._run(model, ckpt_path=self.ckpt_path)                                        │
│    651 │   │                                                                                     │
│    652 │   │   assert self.state.stopped                                                         │
│    653 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1103 in _run                                                                                   │
│                                                                                                  │
│   1100 │   │                                                                                     │
│   1101 │   │   self._checkpoint_connector.resume_end()                                           │
│   1102 │   │                                                                                     │
│ ❱ 1103 │   │   results = self._run_stage()                                                       │
│   1104 │   │                                                                                     │
│   1105 │   │   log.detail(f"{self.__class__.__name__}: trainer tearing down")                    │
│   1106 │   │   self._teardown()                                                                  │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1182 in _run_stage                                                                             │
│                                                                                                  │
│   1179 │   │   │   return self._run_evaluate()                                                   │
│   1180 │   │   if self.predicting:                                                               │
│   1181 │   │   │   return self._run_predict()                                                    │
│ ❱ 1182 │   │   self._run_train()                                                                 │
│   1183 │                                                                                         │
│   1184 │   def _pre_training_routine(self) -> None:                                              │
│   1185 │   │   # wait for all to join if on distributed                                          │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1205 in _run_train                                                                             │
│                                                                                                  │
│   1202 │   │   self.fit_loop.trainer = self                                                      │
│   1203 │   │                                                                                     │
│   1204 │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                     │
│ ❱ 1205 │   │   │   self.fit_loop.run()                                                           │
│   1206 │                                                                                         │
│   1207 │   def _run_evaluate(self) -> _EVALUATE_OUTPUT:                                          │
│   1208 │   │   assert self.evaluating                                                            │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py:194 │
│ in run                                                                                           │
│                                                                                                  │
│   191 │   │                                                                                      │
│   192 │   │   self.reset()                                                                       │
│   193 │   │                                                                                      │
│ ❱ 194 │   │   self.on_run_start(*args, **kwargs)                                                 │
│   195 │   │                                                                                      │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py │
│ :218 in on_run_start                                                                             │
│                                                                                                  │
│   215 │   │   self._results.to(device=self.trainer.lightning_module.device)                      │
│   216 │   │                                                                                      │
│   217 │   │   self.trainer._call_callback_hooks("on_train_start")                                │
│ ❱ 218 │   │   self.trainer._call_lightning_module_hook("on_train_start")                         │
│   219 │   │   self.trainer._call_strategy_hook("on_train_start")                                 │
│   220 │                                                                                          │
│   221 │   def on_advance_start(self) -> None:                                                    │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.p │
│ y:1347 in _call_lightning_module_hook                                                            │
│                                                                                                  │
│   1344 │   │   pl_module._current_fx_name = hook_name                                            │
│   1345 │   │                                                                                     │
│   1346 │   │   with self.profiler.profile(f"[LightningModule]{pl_module.__class__.__name__}.{ho  │
│ ❱ 1347 │   │   │   output = fn(*args, **kwargs)                                                  │
│   1348 │   │                                                                                     │
│   1349 │   │   # restore current_fx when nested context                                          │
│   1350 │   │   pl_module._current_fx_name = prev_fx_name                                         │
│                                                                                                  │
│ /home/ubuntu/naifu-diffusion/lib/model.py:288 in on_train_start                                  │
│                                                                                                  │
│   285 │   │   │   self.ema.to(self.device, dtype=self.unet.dtype)                                │
│   286 │   │                                                                                      │
│   287 │   │   if self.use_latent_cache:                                                          │
│ ❱ 288 │   │   │   self.dataset.cache_latents(self.vae, self.data_sampler.buckets if self.confi   │
│   289 │                                                                                          │
│   290 │   def on_train_epoch_start(self) -> None:                                                │
│   291 │   │   if self.use_latent_cache:                                                          │
│                                                                                                  │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/torch/nn/modules/module.py:1614 in  │
│ __getattr__                                                                                      │
│                                                                                                  │
│   1611 │   │   │   modules = self.__dict__['_modules']                                           │
│   1612 │   │   │   if name in modules:                                                           │
│   1613 │   │   │   │   return modules[name]                                                      │
│ ❱ 1614 │   │   raise AttributeError("'{}' object has no attribute '{}'".format(                  │
│   1615 │   │   │   type(self).__name__, name))                                                   │
│   1616 │                                                                                         │
│   1617 │   def __setattr__(self, name: str, value: Union[Tensor, 'Module']) -> None:             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'StableDiffusionModel' object has no attribute 'data_sampler'

Conversion fails

python convert_to_sd.py --src pruned/dreamshaper_332BakedVaeClipFix.ckpt --dst dream_sim.ckpt

it just shows errors:

Traceback (most recent call last):
  File "K:\3\convert_to_sd.py", line 345, in <module>
    unet_state_dict = convert_unet_state_dict(unet_state_dict, is_v2)
  File "K:\3\convert_to_sd.py", line 107, in convert_unet_state_dict
    new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
  File "K:\3\convert_to_sd.py", line 107, in <dictcomp>
    new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
KeyError: 'time_embedding.linear_1.weight'

Linking the nsfw model

Hello, I don't see a discussion panel, so I apologize for asking this on the issues tab.

I have 2 questions:

I successfully used your script to output a 2GB ckpt. It works well. If I use this model as a base to train in dreambooth, does the resulting model keep the benefits from your script?
How do I link the nsfw model to this script? I can upload my nai-full-pruned to the colab folders through my gdrive, but I am having a hard time running it through the script. Could you help? @Mikubill

resume does not work with use_ema: True

When trying to continue training a model started with use_ema: True, the trainer will crash.
Error:
AttributeError: 'StableDiffusionModel' object has no attribute 'ema'

FileNotFoundError: [Errno 2] No such file or directory: '/content/naifu-diffusion/checkpoint/last.ckpt'

Get error when running this. Tried 3 times with nothing but time wasted.

Colab Issue

859 M     Trainable params
206 M     Non-trainable params
1.1 B     Total params
4,264.941 Total estimated model params size (MB)
Epoch 0: 100% 575/575 [31:47<00:00,  3.32s/it, loss=0.263]tcmalloc: large alloc 1134575616 bytes == 0x17a8de000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 1418223616 bytes == 0x2be0a000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 1772781568 bytes == 0x7ef66a558000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 2215976960 bytes == 0x7ef5e6406000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 2769977344 bytes == 0x7ef54125e000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 3462471680 bytes == 0x7ef5e6406000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 4328095744 bytes == 0x7ef43f2c6000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91
tcmalloc: large alloc 5410119680 bytes == 0x7ef54125e000 @  0x7efc50b2d615 0x5d6f4c 0x51edd1 0x51ef5b 0x5aac95 0x5d8506 0x7efc2ce305fe 0x7efc0645fc85 0x7efc0645a1e7 0x7efc06461309 0x7efc2ce430bb 0x7efc2ca436af 0x5d80be 0x5d8d8c 0x4fedd4 0x4997c7 0x55d078 0x5d8941 0x4990ca 0x55cd91 0x5d8941 0x4997c7 0x5d8868 0x4990ca 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91

it seems that there is a RAM error when saving the checkpoint

WARNING:root:Only 108 Image will be uploaded.

cannot resume latest checkpoint when training with train_sd15.yaml --resume

I running with 'python trainer.py train_sd15.yaml --resume', but training still begins from 0.
Is there any way to continue training from a break point?
Thanks!

DPO training issues

I wanted to give DPO training a try but there seems to be multiple issues with the PairedDataset.

reso and interp are not assigned to self in init but self.reso is being used later in the file

naifu/data/paired_wds.py

Line 54 in af5c34b

"target_size_as_tuple": [(self.reso, self.reso)] * len(original_sizes),

Getting this error when using the default train_dpo.yaml config:

  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/naifu/data/paired_wds.py", line 63, in __getitem__
    example = self.preprocess_train(example)
  File "/workspace/naifu/data/paired_wds.py", line 33, in preprocess_train
    images = [
  File "/workspace/naifu/data/paired_wds.py", line 34, in <listcomp>
    Image.open(io.BytesIO(im_bytes)).convert("RGB")
TypeError: a bytes-like object is required, not 'int'```

Batchsize >1 is currently broken.

I manually set batchsize in buckets.py, because the config does not do anything, then it crashes with:

ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File 
"/home/bunny/miniconda3/envs/nai/lib/python3.8/site-packages/torch/utils/data/_u
tils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File 
"/home/bunny/miniconda3/envs/nai/lib/python3.8/site-packages/torch/utils/data/_u
tils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/media/bunny/D612FBE112FBC511/FinetuneV6/data/store.py", line 189, in 
collate_fn
    z.append(torch.asarray([[self.tokenizer.bos_token_id] + x[:75] + 
[self.tokenizer.eos_token_id] for x in tokens]))
ValueError: expected sequence of length 61 at dim 1 (got 34)

It used to work before this commit cc5b063

Colab TPU Failed

I tested with a modified colab notebook based on your example. The notebook is here . It installs requirement_tpu and runs using default tpu config.

Got error:

/content/naifu-diffusion
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/animesfw.tgz" to /tmp/model

100% 3.58G/3.58G [02:16<00:00, 28.2MB/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0

100% 52.7M/52.7M [00:02<00:00, 19.2MB/s]
Loading resolutions: 34it [00:00, 992.53it/s]
Loading captions: 68it [00:00, 26123.16it/s]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:452: LightningDeprecationWarning: Setting `Trainer(tpu_cores=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='tpu', devices=8)` instead.
  f"Setting `Trainer(tpu_cores={tpu_cores!r})` is deprecated in v1.7 and will be removed"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:467: UserWarning: The flag `devices=-1` will be ignored, instead the device specific number 8 will be used
  f"The flag `devices={devices}` will be ignored, "
/usr/local/lib/python3.7/dist-packages/lightning_lite/accelerators/cuda.py:159: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 296) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
trainer.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-19_18:29:43
  host      : 21021965512a
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 296)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 296
====================================================

[Feature Request] Converting/Using custom non-diffusers SD VAE .ckpt

I'm not sure if this feature is already planned but I thought I'd bring it up.
Since the default model downloaded is one that I assume was trained with a certain custom VAE, it might be good to have a way to use it for training as well.
Just thinking this might possibly fix issues I've been having in my tests with hand details being lost compared to the pre-finetuned model.

Distributed training stuck at: Downloading parameters from peer

I started the training on one machine (works perfectly fine), but when I start the other machine, it gets stuck in downloading parameters.
Here's the output I get:

/home/ai/.virtualenvs/DMshareGenesis/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:151: FutureWarning: The configuration file of the unet has set the default `sample_size` to smaller than 64 which seems highly unlikely. If your checkpoint is a fine-tuned version of any of the following: 
- CompVis/stable-diffusion-v1-4 
- CompVis/stable-diffusion-v1-3 
- CompVis/stable-diffusion-v1-2 
- CompVis/stable-diffusion-v1-1 
- runwayml/stable-diffusion-v1-5 
- runwayml/stable-diffusion-inpainting 
 you should change 'sample_size' to 64 in the configuration file. Please make sure to update the config accordingly as leaving `sample_size=32` in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the `unet/config.json` file
  deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0

100%|██████████| 52.7M/52.7M [00:10<00:00, 5.39MB/s]
You are using a CUDA device ('NVIDIA GeForce RTX 3090 Ti') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Loading captions: 34it [00:00, 11746.82it/s]
Loading resolutions: 0it [00:00, ?it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 34it [00:00, 1118.48it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 5e-06

  | Name         | Type                 | Params
------------------------------------------------------
0 | text_encoder | CLIPTextModel        | 123 M 
1 | vae          | AutoencoderKL        | 83.7 M
2 | unet         | UNet2DConditionModel | 859 M 
------------------------------------------------------
859 M     Trainable params
206 M     Non-trainable params
1.1 B     Total params
4,264.941 Total estimated model params size (MB)
Epoch 0:   0%|          | 0/34 [00:00<?, ?it/s] Found per machine batch size automatically from the batch: 1
Jan 05 07:38:13.862 [INFO] Found no active peers: None
Jan 05 07:38:14.991 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, provide initialize_optimizer=False
Jan 05 07:38:16.819 [INFO] Downloading parameters from peer QmVSfugP26MS1qUBWxXG1RKhZpGkodQuFoM4HFhqTc4mwj

This is potentially a Hivemind issue, but I wanted to check whether anyone has encoutered this issue before.

SDXL checkpoint format/compatibility

Training with the train_sdxl configs outputs checkpoints which are the size of the SDXL unet(10.3 GB) in diffusers format.

How do I use these files for anything?

colab issue DKW

you should change 'sample_size' to 64 in the configuration file. Please make sure to update the config accordingly as leaving sample_size=32 in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the unet/config.json file
deprecate("sample_size<64", "1.0.0", deprecation_message, standard_warn=False)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /content/naifu-diffusion/lightning_logs
Loading images: 34it [00:00, 1461.33it/s]
BucketManager initialized with base_res = (512, 512), max_size = (768, 512)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 7.071067811865476e-06

| Name | Type | Params

0 | unet | UNet2DConditionModel | 859 M
1 | vae | AutoencoderKL | 83.7 M
2 | text_encoder | CLIPTextModel | 123 M

859 M Trainable params
206 M Non-trainable params
1.1 B Total params
4,264.941 Total estimated model params size (MB)
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=1in theDataLoaderto improve performance. /usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/data.py:121: YourIterableDatasethaslendefined. In combination with multi-process data loading (when num_workers > 1),len` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
Epoch 0: 0% 0/17 [00:00<?, ?it/s] Traceback (most recent call last):
File "/content/naifu-diffusion/trainer.py", line 125, in
main(args)
File "/content/naifu-diffusion/trainer.py", line 121, in main
trainer.fit(model=model, ckpt_path=args.resume if args.resume else None)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()

can anyone tell me :(

ValueError: invalid literal for int() with base 10: '16-true'

pytorch==2.0.1+cu118,diffusers== 0.17.0

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
╭─────────────────── Traceback (most recent call last) ────────────────────╮
│ /home/ubuntu/naifu-diffusion/trainer.py:119 in <module>                  │
│                                                                          │
│   116                                                                    │
│   117 if __name__ == "__main__":                                         │
│   118 │   args = parse_args()                                            │
│ ❱ 119 │   main(args)                                                     │
│   120                                                                    │
│                                                                          │
│ /home/ubuntu/naifu-diffusion/trainer.py:115 in main                      │
│                                                                          │
│   112 │                                                                  │
│   113 │   config, callbacks = pl_compat_fix(config, callbacks)           │
│   114 │   trainer = pl.Trainer(logger=logger, callbacks=callbacks, strat │
│ ❱ 115 │   trainer.fit(model=model, ckpt_path=args.resume if args.resume  │
│   116                                                                    │
│   117 if __name__ == "__main__":                                         │
│   118 │   args = parse_args()                                            │
│                                                                          │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/trainer.py:608 in fit                                     │
│                                                                          │
│    605 │   │   if not isinstance(model, pl.LightningModule):             │
│    606 │   │   │   raise TypeError(f"`Trainer.fit()` requires a `Lightni │
│    607 │   │   self.strategy._lightning_module = model                   │
│ ❱  608 │   │   call._call_and_handle_interrupt(                          │
│    609 │   │   │   self, self._fit_impl, model, train_dataloaders, val_d │
│    610 │   │   )                                                         │
│    611                                                                   │
│                                                                          │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/call.py:38 in _call_and_handle_interrupt                  │
│                                                                          │
│   35 │   │   if trainer.strategy.launcher is not None:                   │
│   36 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *ar │
│   37 │   │   else:                                                       │
│ ❱ 38 │   │   │   return trainer_fn(*args, **kwargs)                      │
│   39 │                                                                   │
│   40 │   except _TunerExitException:                                     │
│   41 │   │   trainer._call_teardown_hook()                               │
│                                                                          │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/trainer.py:638 in _fit_impl                               │
│                                                                          │
│    635 │   │   │   )                                                     │
│    636 │   │                                                             │
│    637 │   │   # links data to the trainer                               │
│ ❱  638 │   │   self._data_connector.attach_data(                         │
│    639 │   │   │   model, train_dataloaders=train_dataloaders, val_datal │
│    640 │   │   )                                                         │
│    641                                                                   │
│                                                                          │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/connectors/data_connector.py:148 in attach_data           │
│                                                                          │
│   145 │   │   │   _check_dataloader_none(predict_dataloaders, self._pred │
│   146 │   │                                                              │
│   147 │   │   # set local properties on the model                        │
│ ❱ 148 │   │   self._copy_trainer_model_properties(model)                 │
│   149 │                                                                  │
│   150 │   def _copy_trainer_model_properties(self, model: "pl.LightningM │
│   151 │   │   model.trainer = proxy(self.trainer)                        │
│                                                                          │
│ /home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning/p │
│ ytorch/trainer/connectors/data_connector.py:153 in                       │
│ _copy_trainer_model_properties                                           │
│                                                                          │
│   150 │   def _copy_trainer_model_properties(self, model: "pl.LightningM │
│   151 │   │   model.trainer = proxy(self.trainer)                        │
│   152 │   │   # for backward compatibility                               │
│ ❱ 153 │   │   model.precision = int(self.trainer.precision) if self.trai │
│   154 │                                                                  │
│   155 │   def attach_dataloaders(                                        │
│   156 │   │   self,                                                      │
╰──────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: '16-true'

Does it work? Output images are all noise.

Training with default setting on colab, after 3 epochs the output images are pure noise, (see attachment). Earlier test at 90 epoch, same result.

Some questions.

Looking at this project a few questions come to my mind that are unanswered in the readme.

Does this allow to use >75 tokens as text input, or is the text/tags truncated after 75?
How does limit_train_batches work? Are the images chosen at random? Do I have to change the value if the dataset is >100 images?
Are input images resized automatically for the buckets, or is some pre-processing necessary? (Looking at the test datasets for example I see all images appear to be 1:1 aspect ratio, is this required?)

KeyError: 'time_embedding.linear_1.weight'

I used naifu-diffusion to train the model, and the trained model was converted with convert_to_sd.py, and this error occurred.
xformers==0.0.20，diffusers == 0.17.0 ,torch==2.0.1,
diffusers ==0.10.2 cannot be used

Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Loading captions: 3it [00:00, 2874.78it/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Loading resolutions: 3it [00:00, 72.47it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using scaled LR: 3e-06

  | Name         | Type                 | Params
------------------------------------------------------
0 | unet         | UNet2DConditionModel | 859 M 
1 | vae          | AutoencoderKL        | 83.7 M
2 | text_encoder | CLIPTextModel        | 123 M 
------------------------------------------------------
859 M     Trainable params
206 M     Non-trainable params
1.1 B     Total params
2,132.471 Total estimated model params size (MB)
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/lightning_lite/utilities/data.py:63: UserWarning: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
  rank_zero_warn(
Epoch 0:   0%|                                                          | 0/3 [00:00<?, ?it/s]/home/ubuntu/miniconda3/envs/nd/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
Epoch 0: 100%|██████████████████████████████████████| 3/3 [00:04<00:00,  1.38s/it, loss=0.114]`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████████████████████████████████| 3/3 [00:27<00:00,  9.29s/it, loss=0.114]
Traceback (most recent call last):
  File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 345, in <module>
    unet_state_dict = convert_unet_state_dict(unet_state_dict, is_v2)
  File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 107, in convert_unet_state_dict
    new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
  File "/home/ubuntu/naifu-diffusion/scripts/convert_to_sd.py", line 107, in <dictcomp>
    new_state_dict = {v: unet_state_dict[k] for k, v in mapping.items()}
KeyError: 'time_embedding.linear_1.weight'

AttributeError: 'StableDiffusionModel' object has no attribute 'dataset'

Ran the conda environment, and when using both the multi-gpu and test configs, with the commands supplied in a fresh environment, it always crashes with this error. There's also a fairly length traceback I can send if you can't replicate the issue on your end.

mikubill / naifu Goto Github PK

naifu's People

Contributors

Stargazers

Watchers

Forkers

naifu's Issues

| Name | Type | Params

0 | unet | UNet2DConditionModel | 859 M 1 | vae | AutoencoderKL | 83.7 M 2 | text_encoder | CLIPTextModel | 123 M

Recommend Projects

Recommend Topics

Recommend Org

Jobs

0 | unet | UNet2DConditionModel | 859 M
1 | vae | AutoencoderKL | 83.7 M
2 | text_encoder | CLIPTextModel | 123 M