<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

If you run in docker, you should ensure: gpus are visible in d

I use this command to start my docker container. <div class="highlight highlight-s

It seems a memory problem: <div class="snippet-clipboard-content notranslate posit

GPU usage stuck on 100% percent when using 2* RTX 3090 about cogvlm HOT 12 CLOSED

thudm commented on August 24, 2024

GPU usage stuck on 100% percent when using 2* RTX 3090

from cogvlm.

Comments (12)

1049451037 commented on August 24, 2024

Could model do inference normally if you input something?

from cogvlm.

HSPK commented on August 24, 2024

Yes, it performed normally.

from cogvlm.

HSPK commented on August 24, 2024

My linux distribution is Centos 7. I run cli_demo.py in cuda-12.1.0-devel-ubuntu-20.04 docker container and encountered another problem.

(base) root@d6726d343e8b:/cogvlm# torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained pretrained/cogvlm-chat --ve
rsion chat --english --bf16 --local_tokenizer pretrained/vicuna-7b-v1.5/
[2023-10-18 02:01:55,109] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] 
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] *****************************************
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] *****************************************
[2023-10-18 02:01:58,470] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 02:01:58,568] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 02:02:00,801] [INFO] building CogVLMModel model ...
[2023-10-18 02:02:00,942] [INFO] building CogVLMModel model ...
[2023-10-18 02:02:02,159] [INFO] [RANK 0] > initializing model parallel with size 2
[2023-10-18 02:02:02,162] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-18 02:02:14,885] [INFO] [RANK 1]  > number of parameters on model parallel rank 1: 8893252992
[2023-10-18 02:02:16,267] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 8893252992
[2023-10-18 02:02:19,333] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-18 02:02:19,335] [INFO] [RANK 0] building CogVLMModel model ...
[2023-10-18 02:02:31,184] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-18 02:02:37,736] [INFO] [RANK 0] global rank 0 is loading checkpoint pretrained/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-18 02:02:53,570] [INFO] [RANK 0] > successfully loaded pretrained/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-18 02:03:00,416] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4559 closing signal SIGTERM
[2023-10-18 02:03:20,651] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 1 (pid: 4560) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
cli_demo.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-18_02:03:00
  host      : d6726d343e8b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 4560)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 4560
====================================================

from cogvlm.

1049451037 commented on August 24, 2024

If you run in docker, you should ensure:

gpus are visible in docker
enough memory is allocated to docker
cuda version in docker is not older than your physical machine's

from cogvlm.

HSPK commented on August 24, 2024

I use this command to start my docker container.

docker run -itd --runtime=nvidia --gpus all -v .:/cogvlm 7509bd76c837

htop shows it has 112 CPUs and 500GB mem, nvidia-smi shows it has 8 * RTX 3090, cuda version is same as host machine.

from cogvlm.

1049451037 commented on August 24, 2024

I'm not sure. Are your GPUs empty when you run the code? Or if your torch version is not new enough which has some potential bug?

from cogvlm.

HSPK commented on August 24, 2024

Yes, thery are empty when i run the code. My torch version is 2.1.0 built with cuda 12.1.

from cogvlm.

1049451037 commented on August 24, 2024

It seems a memory problem:

Signal 7 (SIGBUS) is a bus error described [here 51](https://en.wikipedia.org/wiki/Bus_error) which usually indicates 
“that a process is trying to access [memory 4](https://en.wikipedia.org/wiki/Computer_data_storage) 
that the [CPU](https://en.wikipedia.org/wiki/Central_processing_unit) cannot physically address”.

Please check your docker configuration.

refer to: https://discuss.pytorch.org/t/what-is-the-meaning-of-exitcode-in-torchrun/181775/2

from cogvlm.

HSPK commented on August 24, 2024

Thanks! /dev/shm is 64MB is too small in my case, I reconfigured it and cli_demo.py can run properly.
But the GPU usage stuck on 100% percent problem still exists in docker environment. sad...

from cogvlm.

ilovesouthpark commented on August 24, 2024

Try the way below and i know you can translate so leave the job to you. Please let us know if your problems can be solved.

因为驱动原因，4090在多卡并行时会因为通信问题而卡住，具体表现为程序不动，显卡率用率100%。这主要源于4090不支持P2P通信，因此当出现该情况时，需要手动禁用P2P通信。

在运行代码的命令中加入以下的export命令：

export NCCL_IB_DISABLE=1

export NCCL_P2P_DISABLE=1

或者直接在运行命令前面，加上 NCCL_P2P_DISABLE=1，如：

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python train

作者：Dr罗勒酱
链接：https://www.jianshu.com/p/9f9c2ca98997
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

from cogvlm.

ilovesouthpark commented on August 24, 2024

And also the post blow. https://zhuanlan.zhihu.com/p/581988527
I am ready to deploy the model with 2x4090 so your results can be interested to know. Thanks

from cogvlm.

1049451037 commented on August 24, 2024

Good job. @ilovesouthpark

from cogvlm.

GPU usage stuck on 100% percent when using 2* RTX 3090 about cogvlm HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs