GithubHelp home page GithubHelp logo

modelscope / facechain Goto Github PK

View Code? Open in Web Editor NEW
8.8K 90.0 822.0 98.01 MB

FaceChain is a deep-learning toolchain for generating your Digital-Twin.

License: Apache License 2.0

Python 17.95% Shell 0.09% CSS 0.19% Jupyter Notebook 81.77%

facechain's Issues

mat1 and mat2 must have the same dtype

08/17/2023 14:39:07 - INFO - __main__ - ***** Running training *****
08/17/2023 14:39:07 - INFO - __main__ -   Num examples = 3
08/17/2023 14:39:07 - INFO - __main__ -   Num Epochs = 200
08/17/2023 14:39:07 - INFO - __main__ -   Instantaneous batch size per device = 1
08/17/2023 14:39:07 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/17/2023 14:39:07 - INFO - __main__ -   Gradient Accumulation steps = 1
08/17/2023 14:39:07 - INFO - __main__ -   Total optimization steps = 600
Steps:   0%|                                                                                    | 0/600 [00:00<?, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ facechain/facechain/train_text_to_image_lora.py:1103 in <module>           │
│                                                                                                  │
│   1100                                                                                           │
│   1101                                                                                           │
│   1102 if __name__ == "__main__":                                                                │
│ ❱ 1103 │   main()                                                                                │
│   1104                                                                                           │
│                                                                                                  │
│ facechain/facechain/train_text_to_image_lora.py:924 in main                │
│                                                                                                  │
│    921 │   │   │   │   │   raise ValueError(f"Unknown prediction type {noise_scheduler.config.p  │
│    922 │   │   │   │                                                                             │
│    923 │   │   │   │   # Predict the noise residual and compute loss                             │
│ ❱  924 │   │   │   │   model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sampl  │
│    925 │   │   │   │   loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")   │
│    926 │   │   │   │                                                                             │
│    927 │   │   │   │   # Gather the losses across all processes for logging (if we use distribu  │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in           │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py:805 in │
│ forward                                                                                          │
│                                                                                                  │
│   802 │   │   # there might be better ways to encapsulate this.                                  │
│   803 │   │   t_emb = t_emb.to(dtype=sample.dtype)                                               │
│   804 │   │                                                                                      │
│ ❱ 805 │   │   emb = self.time_embedding(t_emb, timestep_cond)                                    │
│   806 │   │   aug_emb = None                                                                     │
│   807 │   │                                                                                      │
│   808 │   │   if self.class_embedding is not None:                                               │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in           │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/diffusers/models/embeddings.py:192 in        │
│ forward                                                                                          │
│                                                                                                  │
│   189 │   def forward(self, sample, condition=None):                                             │
│   190 │   │   if condition is not None:                                                          │
│   191 │   │   │   sample = sample + self.cond_proj(condition)                                    │
│ ❱ 192 │   │   sample = self.linear_1(sample)                                                     │
│   193 │   │                                                                                      │
│   194 │   │   if self.act is not None:                                                           │
│   195 │   │   │   sample = self.act(sample)                                                      │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194 in           │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/torch/nn/modules/linear.py:114 in forward    │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: mat1 and mat2 must have the same dtype
Steps:   0%|                                                                                    | 0/600 [00:02<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home//.local/bin/accelerate:8 in <module>                                              │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py:45 in  │
│ main                                                                                             │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/accelerate/commands/launch.py:941 in         │
│ launch_command                                                                                   │
│                                                                                                  │
│   938 │   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA   │
│   939 │   │   sagemaker_launcher(defaults, args)                                                 │
│   940 │   else:                                                                                  │
│ ❱ 941 │   │   simple_launcher(args)                                                              │
│   942                                                                                            │
│   943                                                                                            │
│   944 def main():                                                                                │
│                                                                                                  │
│ /home//.local/lib/python3.10/site-packages/accelerate/commands/launch.py:603 in         │
│ simple_launcher                                                                                  │
│                                                                                                  │
│   600 │   process.wait()                                                                         │
│   601 │   if process.returncode != 0:                                                            │
│   602 │   │   if not args.quiet:                                                                 │
│ ❱ 603 │   │   │   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)    │
│   604 │   │   else:                                                                              │
│   605 │   │   │   sys.exit(1)                                                                    │
│   606                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['miniconda3/bin/python', 'facechain/train_text_to_image_lora.py',
'--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film',
'--dataset_name=./imgs', '--output_dataset_name=./processed', '--caption_column=text', '--resolution=512',
'--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000',
'--learning_rate=1e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=./output',
'--lora_r=32', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32']' returned non-zero exit
status 1.

Can't download 3,2gb model

It's currently not possible to download the 3.20GB model.
The download fails at ~95%. This reproduceable on colab and local

Downloading:  92% 2.95G/3.20G [01:54<00:05, 46.1MB/s]


Downloading:  93% 2.97G/3.20G [01:54<00:05, 48.3MB/s]


Downloading:  93% 2.98G/3.20G [01:55<00:06, 34.6MB/s]


Downloading:  94% 3.00G/3.20G [01:55<00:06, 32.4MB/s]


Downloading:  94% 3.01G/3.20G [01:57<00:10, 19.7MB/s]


Downloading:  95% 3.04G/3.20G [01:58<00:07, 21.9MB/s]


Downloading:  95% 3.05G/3.20G [01:59<00:08, 20.7MB/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 710, in _error_catcher
    yield
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 814, in _raw_read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 799, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
  File "/usr/lib/python3.10/http/client.py", line 466, in read
    s = self.fp.read(amt)
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 940, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 879, in read
    data = self._raw_read(amt)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 813, in _raw_read
    with self._error_catcher():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/dist-packages/urllib3/response.py", line 727, in _error_catcher
    raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 488, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1431, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1109, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 706, in wrapper
    response = f(*args, **kwargs)
  File "/content/facechain/app.py", line 184, in run
    data_process_fn(instance_data_dir, True)
  File "/content/facechain/facechain/inference.py", line 23, in data_process_fn
    data_process_fn = Blipv2()
  File "/content/facechain/facechain/data_process/preprocessing.py", line 202, in __init__
    self.model = DeepDanbooru()
  File "/content/facechain/facechain/data_process/deepbooru.py", line 721, in __init__
    snapshot_path = snapshot_download(foundation_model_id, revision='v4.0')
  File "/usr/local/lib/python3.10/dist-packages/modelscope/hub/snapshot_download.py", line 140, in snapshot_download
    parallel_download(
  File "/usr/local/lib/python3.10/dist-packages/modelscope/hub/file_download.py", line 243, in parallel_download
    list(executor.map(download_part, tasks))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/modelscope/hub/file_download.py", line 203, in download_part
    for chunk in r.iter_content(chunk_size=API_FILE_DOWNLOAD_CHUNK_SIZE):
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 818, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Error: nms_impl: implementation for device cuda:0 not found.

When upload picture and start trainning, there will be error on server side.
`2023-08-19 16:09:33,371 - modelscope - INFO - load model done
cathed for image process of 000.jpg
Error: nms_impl: implementation for device cuda:0 not found.

[]
Error: result is empty.
Traceback (most recent call last):
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\gradio\routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\gradio\blocks.py", line 1431, in process_api
result = await self.call_function(
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\gradio\blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\ProgramData\anaconda3\envs\fchain\lib\site-packages\gradio\utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "D:\dev\facechain\app.py", line 174, in run
data_process_fn(instance_data_dir, True)
File "D:\dev\facechain\facechain\inference.py", line 24, in data_process_fn
out_json_name = data_process_fn(input_img_dir)
File "D:\dev\facechain\facechain\data_process\preprocessing.py", line 335, in call
exit()
File "C:\ProgramData\anaconda3\envs\fchain\lib_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: None`

执行"开始推理"时报错 OSError: [Errno 122] Disk quota exceeded

按照 ModelScope notebook 的方式跑起来了, 提示模型训练成功后, 执行"开始推理"时报错 OSError: [Errno 122] Disk quota exceeded

运行环境: 魔搭平台免费实例, PAI-DSW, GPU环境

8核 32GB 显存16G
预装 ModelScope Library
预装镜像  ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.1

工作空间的空间大小如下

root@dsw:/mnt/workspace# du -h -d 1
14G     ./.cache
73K     ./.ipynb_checkpoints
8.5K    ./.virtual_documents
574K    ./facechain
14G     .

魔搭平台免费实例是不是提供的硬盘太小了? 看 facechain 官方 README 是要求 Disk: About 50GB

有没有什么办法能够在魔搭平台免费实例上成功体验完全流程呢?

No such file or directory: '/tmp/qw/personalizaition_lora/pytorch_lora_weights.bin'

File "/home/yyy/facechain/facechain/inference.py", line 47, in main_diffusion_inference
pipe = merge_lora(pipe, lora_human_path, multiplier_human, from_safetensor=False)
File "/home/yyy/facechain/facechain/merge_lora.py", line 15, in merge_lora
checkpoint = torch.load(os.path.join(lora_path, 'pytorch_lora_weights.bin'),
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/qw/personalizaition_lora/pytorch_lora_weights.bin
训练生成的为safetensor,为什么代码中的为bin?为什么会找不到该文件?请教,感谢。

在colab跑起来出错

在colab上跑起来,用的是A100,到最后一步都正常,网页打开,上传照片后,点击开始训练,提示“CUDA is not available”
日志如下
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1431, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "/content/facechain/app.py", line 123, in run
raise gr.Error('CUDA is not available.')
gradio.exceptions.Error: 'CUDA is not available.'

Could not find a version that satisfies the requirement tf-estimator-nightly==2.8.0.dev2021122109

2.8.0的tensorflow好像有问题,python的3.8可以下载这个包吗?
INFO: pip is looking at multiple versions of tensorflow to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 1.11.0 Requires-Python <3.13,>=3.9; 1.11.0rc1 Requires-Python <3.13,>=3.9; 1.11.0rc2 Requires-Python <3.13,>=3.9; 1.11.1 Requires-Python <3.13,>=3.9; 1.11.2 Requires-Python <3.13,>=3.9; 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 3.8.0rc1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement tf-estimator-nightly==2.8.0.dev2021122109 (from tensorflow) (from versions: none)
ERROR: No matching distribution found for tf-estimator-nightly==2.8.0.dev2021122109

Error when training data

Windows环境下,训练报错,导致后面无法推理

 File "D:\ProgramData\anaconda3\envs\facechain\lib\site-packages\datasets\packaged_modules\folder_based_builder\folder_based_builder.py", line 311, in _generate_examples
    raise ValueError(
ValueError: image at tmp.png doesn't have metadata in D:\AI\qw\training_data\personalizaition_lora_labeled\metadata.jsonl.

查看后台相关信息,发现一个“rm”命令的错误信息,rm是linux命令,在windows下没有此命令。

2023-08-20 00:15:28.975118: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8700
000.jpg 0.9607361331582069
1girl, brown_eyes, brown_hair, earrings, jewelry, lips, long_hair, looking_at_viewer, open_mouth, simple_background, smile, solo, teeth, transparent_background
[['1girl', 'brown_eyes', 'brown_hair', 'earrings', 'jewelry', 'lips', 'long_hair', 'looking_at_viewer', 'open_mouth', 'simple_background', 'smile', 'solo', 'teeth', 'transparent_background']]
'rm' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
0.png a beautiful woman, brown_hair, earrings, jewelry, long_hair, looking_at_viewer, open_mouth, simple_background, smile, solo, transparent_background
08/20/2023 00:15:31 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

mmcv和modelscope版本问题

目前看上去modelscop库里的很多task依然是基于mmcv<2.0.0编写的,要修改大量的地方,比如from mmcv.parallel import MMDataParallel ,后续会有更新吗

/opt/conda/bin/python: can't open file 'facechain/train_text_to_image_lora.py': [Errno 2] No such file or directory

使用容器部署,GPU A10 NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0
webUI执行训练后后台日志输出如下错误,web现实训练完成但是形象体验Error

/opt/conda/bin/python: can't open file 'facechain/train_text_to_image_lora.py': [Errno 2] No such file or directory
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/opt/conda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', 'facechain/train_text_to_image_lora.py', '--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film', '--output_dataset_name=/tmp/qw/training_data/personalizaition_lora', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000', '--learning_rate=1e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=/tmp/qw/personalizaition_lora', '--lora_r=32', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32']' returned non-zero exit status 2.

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

run the script PYTHONPATH=. sh train_lora.sh "ly261666/cv_portrait_model" "v2.0" "film/film" "./imgs" "./processed" "./output"

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 399410) of binary: /home/disk01/wyw/.conda/envs/facechain/bin/python
Traceback (most recent call last):
File "/home/disk01/wyw/.conda/envs/facechain/bin/accelerate", line 8, in
sys.exit(main())
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/disk01/wyw/.conda/envs/facechain/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

facechain/train_text_to_image_lora.py FAILED

Does it support run on local machine?

I try to train lora on my machine? but it raise error

In [1]: from modelscope import snapshot_download
^[[A2023-08-14 11:23:11,600 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2023-08-14 11:23:11,602 - modelscope - INFO - TensorFlow version 2.13.0 Found.
2023-08-14 11:23:11,602 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-08-14 11:23:11,631 - modelscope - INFO - Loading done! Current index file version is 1.8.1, with md5 bbb8dd73324c667bf9ab6594815ac903 and a total number of 893 components indexed

In [2]: model_dir = snapshot_download('Cherrytest/rot_bgr', revision='v1.0.0')
2023-08-14 11:23:13,696 - modelscope - ERROR - Authentication token does not exist, failed to access model Cherrytest/rot_bgr which may not exist or may be                 private. Please login first.

关于cuda11.7

我在部署facechain的时候,系统要求cuda11.7,可是在ubuntu上的 4090对应的显卡驱动都是支持的cuda12.2的,安装不上11.7。请问是不是cuda12.2 的也可以,为什么我的这边一直报错,训练的时候。。

还有两个问题:
1: 如果conda安装的python3.10.6的时候,如果是cuda12.2,那么mim install mmcv-full==1.7.0压根就无法执行安装
2: 但是如果用python3.8的话是可以安装成功mim install mmcv-full==1.7.0,启动程序,但是训练的时候报错。

Error: result is empty.

[]
Error: result is empty.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1431, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "/content/facechain/app.py", line 149, in run
data_process_fn(instance_data_dir, True)
File "/content/facechain/facechain/inference.py", line 24, in data_process_fn
out_json_name = data_process_fn(input_img_dir)
File "/content/facechain/facechain/data_process/preprocessing.py", line 335, in call
exit()
File "/usr/lib/python3.10/_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: None

When running on Colab (t4 Runtime)

CUDA out of memory error on training

When I run into the following error on Alibaba Cloud DSW with an NVIDIA V100 instance

image: modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.1

DSW NVIDIA V100

08/18/2023 19:46:51 - INFO - __main__ - ***** Running training *****
08/18/2023 19:46:51 - INFO - __main__ -   Num examples = 9
08/18/2023 19:46:51 - INFO - __main__ -   Num Epochs = 200
08/18/2023 19:46:51 - INFO - __main__ -   Instantaneous batch size per device = 1
08/18/2023 19:46:51 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/18/2023 19:46:51 - INFO - __main__ -   Gradient Accumulation steps = 1
08/18/2023 19:46:51 - INFO - __main__ -   Total optimization steps = 1800
Steps:   0%|                                           | 0/1800 [00:00<?, ?it/s]Traceback (most recent call last):
  File "facechain/train_text_to_image_lora.py", line 1103, in <module>
    main()
  File "facechain/train_text_to_image_lora.py", line 924, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 956, in forward
    sample = upsample_block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py", line 2127, in forward
    hidden_states = attn(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/transformer_2d.py", line 291, in forward
    hidden_states = block(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/attention.py", line 154, in forward
    attn_output = self.attn1(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 321, in forward
    return self.processor(
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 601, in __call__
    attention_probs = attn.get_attention_scores(query, key, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 362, in get_attention_scores
    attention_scores = torch.baddbmm(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.78 GiB total capacity; 8.13 GiB already allocated; 469.75 MiB free; 8.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

On windows you must use pip to install mmcv-full

When I use
mim install mmcv-full==1.7.0

I always got following error
RuntimeError: nms_impl: implementation for device cuda:0 not found.

I thought it was cuda version problem and try to downgrade my cuda from 12.2 to 11.8.

Finally when I use

mim uninstall mmcv-full
pip install mmcv-full

The building process will take twenty minutes. When I restart app, the error is gone.

Expected all tensors to be on the same device

Error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

当有多张显卡的时候,会找不到指定的显卡,请问需要怎么解决?

训练时对图片的处理问题

你好,我看训练源码时发现只对图片进行了旋转矫正,readme中的其他矫正如人脸美颜等在代码中并没有体现。以及是否在训练人脸lora时未对人脸打标签?

使用windows11反复尝试修改已经可以跑起来,到形象体验->开始生成报错,实在没找到原因,还是请帮忙确认下,这个错误是否可以定位到问题?

windows11
python3.8
CUDA 11.7
GPU GeForce RTX 4060

与readme中环境的差异:
1、mmcv-full==1.7.0 报错,nms_impl: implementation for device cuda:0 not found. 反复卸载重装依旧报错,改用1.7.1后成功。
到形象体验->开始生成报错:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\AI\facechain\tmp/qw/personalizaition_lora\pytorch_lora_weights.bin'

CUDA is not available issue with colab

all works fine but when start training i get this error
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1431, in process_api
result = await self.call_function(
File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "/content/facechain/app.py", line 123, in run
raise gr.Error('CUDA is not available.')
gradio.exceptions.Error: 'CUDA is not available.'

安装mmcv==1.7.0之后运行出现问题

2023-08-21 16:42:31,201 - modelscope - INFO - Model revision not specified, use the latest revision: v1.1
2023-08-21 16:42:31,396 - modelscope - INFO - initiate model from /home/hx/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd
2023-08-21 16:42:31,396 - modelscope - INFO - initiate model from location /home/hx/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd.
2023-08-21 16:42:31,397 - modelscope - INFO - initialize model from /home/hx/.cache/modelscope/hub/damo/cv_ddsar_face-detection_iclr23-damofd
Traceback (most recent call last):
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/modelscope/utils/registry.py", line 210, in build_from_cfg
return obj_cls._instantiate(**args)
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 66, in _instantiate
return cls(**kwargs)
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/damofd_detect.py", line 31, in init
super().init(model_dir, **kwargs)
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/modelscope/models/cv/face_detection/scrfd/scrfd_detect.py", line 36, in init
from mmdet.models import build_detector
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmdet/models/init.py", line 2, in
from .backbones import * # noqa: F401,F403
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmdet/models/backbones/init.py", line 2, in
from .csp_darknet import CSPDarknet
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmdet/models/backbones/csp_darknet.py", line 11, in
from ..utils import CSPLayer
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmdet/models/utils/init.py", line 13, in
from .point_sample import (get_uncertain_point_coords_with_randomness,
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmdet/models/utils/point_sample.py", line 3, in
from mmcv.ops import point_sample
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmcv/ops/init.py", line 2, in
from .active_rotated_filter import active_rotated_filter
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in
ext_module = ext_loader.load_ext(
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext
ext = importlib.import_module('mmcv.' + name)
File "/home/hx/anaconda3/envs/facechain/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'mmcv._ext'

你好,我尝试过安装2.0.0版本的mmcv,但是安装后训练过程中出现报错,can not import name 'Config' from mmcv
所以后面换回1.7.0版本,但是出现了以上问题,请问该如何解决

有更多的服饰吗

目前只有几个屌爆了的silver armor之类的

examples = {
'prompt_male': [
['silver armor'],
['T-shirt']
],
'prompt_female': [
['beautiful traditional hanfu, upper_body'],
['an elegant evening gown']
],
}

example_styles = [
{'name': '默认风格(default style)'},
{'name': '凤冠霞帔(Chinese traditional gorgeous suit)',
'model_id': 'ly261666/civitai_xiapei_lora',
'revision': 'v1.0.0',
'bin_file': 'xiapei.safetensors',
'multiplier_style': 0.35,
'add_prompt_style': 'red, hanfu, tiara, crown, '},
]

torch2的 mmcv-full安装不了,一直卡在Building wheel for mmcv-full (setup.py) ... /

mim install mmcv-full==1.7.0
Looking in indexes: http://mirrors.aliyun.com/pypi/simple
Looking in links: https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/index.html
Collecting mmcv-full==1.7.0
Downloading http://mirrors.aliyun.com/pypi/packages/a1/81/89120850923f4c8b49efba81af30160e7b1b305fdfa9671a661705a8abbf/mmcv-full-1.7.0.tar.gz (593 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 593.6/593.6 kB 4.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: addict in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (2.4.0)
Requirement already satisfied: numpy in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (1.22.0)
Requirement already satisfied: packaging in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (23.1)
Requirement already satisfied: Pillow in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (10.0.0)
Requirement already satisfied: pyyaml in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (6.0.1)
Requirement already satisfied: yapf in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from mmcv-full==1.7.0) (0.40.1)
Requirement already satisfied: importlib-metadata>=6.6.0 in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from yapf->mmcv-full==1.7.0) (6.8.0)
Requirement already satisfied: platformdirs>=3.5.1 in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from yapf->mmcv-full==1.7.0) (3.10.0)
Requirement already satisfied: tomli>=2.0.1 in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from yapf->mmcv-full==1.7.0) (2.0.1)
Requirement already satisfied: zipp>=0.5 in /root/autodl-tmp/conda/envs/facechain/lib/python3.10/site-packages (from importlib-metadata>=6.6.0->yapf->mmcv-full==1.7.0) (3.16.2)
Building wheels for collected packages: mmcv-full
Building wheel for mmcv-full (setup.py) ... /

无法下载 人脸融合模型 cv_unet-image-face-fusion_damo

你好,能提供下载链接吗

报错如下:

k8s 上跑遇见的问题

Dockerfile

FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.7.1-py38-torch2.0.1-tf1.15.5-1.8.0
RUN pip3 install gradio

SHELL ["/bin/bash", "--login", "-c"]
RUN GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/modelscope/facechain.git --depth 1
WORKDIR facechain
ENV NVIDIA_DISABLE_REQUIRE=true

ENTRYPOINT ["python3","app.py"]

阿里云的 k8s调度的ecs

2023-08-20 09:08:15.817344: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
app.py:302: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
output_images = gr.Gallery(label='Output', show_label=False).style(columns=3, rows=2, height=600,

Error when training data

在windows11下运行app.py出现这个错误:'PYTHONPATH' 不是内部或外部命令,也不是可运行的程序

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.