System Info latest Who can help? <p dir="a

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

如果在单个GPU上out of memory 如何用两个GPU加载推理同一个模型？ about transformers HOT 3 OPEN

zhaoyuchen1128 commented on August 16, 2024

如果在单个GPU上out of memory 如何用两个GPU加载推理同一个模型？

from transformers.

Comments (3)

RUFFY-369 commented on August 16, 2024

Hi @zhaoyuchen1128 you can do that as follows for single GPU out of memory issue (對於單 GPU 記憶體不足問題，您可以執行以下操作):

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "Trelis/Llama-2-7b-chat-hf-function-calling-v2", low_cpu_mem_usage=True,offload_state_dict=True
)

low_cpu_mem_usage : it will load the model using ~1x model size CPU memory (它將使用約 1 倍模型大小的 CPU 記憶體載入模型)
offload_state_dict : it will temporarily offload the CPU state dict to hard drive and will prevent getting out of RAM (它將暫時將 CPU 狀態指令卸載到硬碟並防止記憶體溢出)

For multiple GPUs model loading use (用於多GPU模型載入使用) : device_map = 'auto' as follows:

model = AutoModelForCausalLM.from_pretrained(
    "Trelis/Llama-2-7b-chat-hf-function-calling-v2", low_cpu_mem_usage=True,offload_state_dict=True, device_map = 'auto'
)

device_map : It loads a model onto multiple GPUs (它將模型載入到多個 GPU 上)

Cheers!

from transformers.

zhaoyuchen1128 commented on August 16, 2024

Hi @zhaoyuchen1128 you can do that as follows for single GPU out of memory issue (對於單 GPU 記憶體不足問題，您可以執行以下操作):
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "Trelis/Llama-2-7b-chat-hf-function-calling-v2", low_cpu_mem_usage=True,offload_state_dict=True
)
low_cpu_mem_usage : it will load the model using ~1x model size CPU memory (它將使用約 1 倍模型大小的 CPU 記憶體載入模型) offload_state_dict : it will temporarily offload the CPU state dict to hard drive and will prevent getting out of RAM (它將暫時將 CPU 狀態指令卸載到硬碟並防止記憶體溢出)

For multiple GPUs model loading use (用於多GPU模型載入使用) : device_map = 'auto' as follows:
model = AutoModelForCausalLM.from_pretrained(
    "Trelis/Llama-2-7b-chat-hf-function-calling-v2", low_cpu_mem_usage=True,offload_state_dict=True, device_map = 'auto'
)
device_map : It loads a model onto multiple GPUs (它將模型載入到多個 GPU 上)

Cheers!

Thank you.But when I load a model ,it seems too slowly with two 4090 RTX,is that normal? The model is llama3:70b of 140GB

from transformers.

RUFFY-369 commented on August 16, 2024

@zhaoyuchen1128 if you feel the model loading as pretty slow then please go through this guide of improving the speed with accelerate library. It gives all the details regarding it and will pretty much answer all the questions and if any remains the issues are always open.

Cheers!

from transformers.

Recommend Projects

如果在单个GPU上out of memory 如何用两个GPU加载推理同一个模型？ about transformers HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs