GithubHelp home page GithubHelp logo

Comments (12)

1049451037 avatar 1049451037 commented on August 24, 2024

Could model do inference normally if you input something?

from cogvlm.

HSPK avatar HSPK commented on August 24, 2024

Yes, it performed normally.

from cogvlm.

HSPK avatar HSPK commented on August 24, 2024

My linux distribution is Centos 7. I run cli_demo.py in cuda-12.1.0-devel-ubuntu-20.04 docker container and encountered another problem.

(base) root@d6726d343e8b:/cogvlm# torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained pretrained/cogvlm-chat --ve
rsion chat --english --bf16 --local_tokenizer pretrained/vicuna-7b-v1.5/
[2023-10-18 02:01:55,109] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] 
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] *****************************************
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2023-10-18 02:01:55,110] torch.distributed.run: [WARNING] *****************************************
[2023-10-18 02:01:58,470] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 02:01:58,568] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 02:02:00,801] [INFO] building CogVLMModel model ...
[2023-10-18 02:02:00,942] [INFO] building CogVLMModel model ...
[2023-10-18 02:02:02,159] [INFO] [RANK 0] > initializing model parallel with size 2
[2023-10-18 02:02:02,162] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-18 02:02:14,885] [INFO] [RANK 1]  > number of parameters on model parallel rank 1: 8893252992
[2023-10-18 02:02:16,267] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 8893252992
[2023-10-18 02:02:19,333] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-18 02:02:19,335] [INFO] [RANK 0] building CogVLMModel model ...
[2023-10-18 02:02:31,184] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 17639685376
[2023-10-18 02:02:37,736] [INFO] [RANK 0] global rank 0 is loading checkpoint pretrained/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-18 02:02:53,570] [INFO] [RANK 0] > successfully loaded pretrained/cogvlm-chat/1/mp_rank_00_model_states.pt
[2023-10-18 02:03:00,416] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4559 closing signal SIGTERM
[2023-10-18 02:03:20,651] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 1 (pid: 4560) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
cli_demo.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-18_02:03:00
  host      : d6726d343e8b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 4560)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 4560
====================================================

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

If you run in docker, you should ensure:

  1. gpus are visible in docker
  2. enough memory is allocated to docker
  3. cuda version in docker is not older than your physical machine's

from cogvlm.

HSPK avatar HSPK commented on August 24, 2024

I use this command to start my docker container.

docker run -itd --runtime=nvidia --gpus all -v .:/cogvlm 7509bd76c837

htop shows it has 112 CPUs and 500GB mem, nvidia-smi shows it has 8 * RTX 3090, cuda version is same as host machine.

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

I'm not sure. Are your GPUs empty when you run the code? Or if your torch version is not new enough which has some potential bug?

from cogvlm.

HSPK avatar HSPK commented on August 24, 2024

Yes, thery are empty when i run the code. My torch version is 2.1.0 built with cuda 12.1.

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

It seems a memory problem:

Signal 7 (SIGBUS) is a bus error described [here 51](https://en.wikipedia.org/wiki/Bus_error) which usually indicates 
“that a process is trying to access [memory 4](https://en.wikipedia.org/wiki/Computer_data_storage) 
that the [CPU](https://en.wikipedia.org/wiki/Central_processing_unit) cannot physically address”.

Please check your docker configuration.

refer to: https://discuss.pytorch.org/t/what-is-the-meaning-of-exitcode-in-torchrun/181775/2

from cogvlm.

HSPK avatar HSPK commented on August 24, 2024

Thanks! /dev/shm is 64MB is too small in my case, I reconfigured it and cli_demo.py can run properly.
But the GPU usage stuck on 100% percent problem still exists in docker environment. sad...

from cogvlm.

ilovesouthpark avatar ilovesouthpark commented on August 24, 2024

Try the way below and i know you can translate so leave the job to you. Please let us know if your problems can be solved.

因为驱动原因,4090在多卡并行时会因为通信问题而卡住,具体表现为程序不动,显卡率用率100%。这主要源于4090不支持P2P通信,因此当出现该情况时,需要手动禁用P2P通信。

在运行代码的命令中加入以下的export命令:

export NCCL_IB_DISABLE=1

export NCCL_P2P_DISABLE=1

或者直接在运行命令前面,加上 NCCL_P2P_DISABLE=1,如:

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python train

作者:Dr罗勒酱
链接:https://www.jianshu.com/p/9f9c2ca98997
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

from cogvlm.

ilovesouthpark avatar ilovesouthpark commented on August 24, 2024

And also the post blow. https://zhuanlan.zhihu.com/p/581988527
I am ready to deploy the model with 2x4090 so your results can be interested to know. Thanks

from cogvlm.

1049451037 avatar 1049451037 commented on August 24, 2024

Good job. @ilovesouthpark

from cogvlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.