Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="17

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Here's an example for an OPT model that I just tested on 2 GPUs: <div class="highl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

OPT in TP or PP mode about deepspeed-mii HOT 7 CLOSED

microsoft commented on June 22, 2024

OPT in TP or PP mode

from deepspeed-mii.

Comments (7)

mrwyattii commented on June 22, 2024 1

@Tianwei-She responded to your other issue with a solution (#99).

@volkerha I've made some changes to how we load models in #105. This doesn't completely address the issue of needing to load multiple copies of a model when using tensor parallelism, but we do have plans to address this further. I'll leave this issue open for now and file it under "Enhancment".

from deepspeed-mii.

mrwyattii commented on June 22, 2024 1

#199 adds supported for loading models other than BLOOM (including GPT-NeoX, GPT-J, and OPT) using meta tensors. This resolves the problem of loading the model into memory multiple times.

from deepspeed-mii.

mrwyattii commented on June 22, 2024

Hi @volkerha, thanks for using MII! If you take a look here, you'll see that regardless of the provider the models are processed by the DeepSpeed Inference Engine. This allows any of the models to be run on multi-GPU setups (using TP). To enable this, just add "tensor_parallel":2 to your mii_config dict passed to mii.deploy(). Some of our examples demonstrate this: https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-bloom-example.py

from deepspeed-mii.

mrwyattii commented on June 22, 2024

Here's an example for an OPT model that I just tested on 2 GPUs:

import mii

mii_config = {"dtype": "fp16", "tensor_parallel": 2}
name = "facebook/opt-1.3b"

mii.deploy(
    task="text-generation",
    model=name,
    deployment_name=name + "_deployment",
    mii_config=mii_config,
)

from deepspeed-mii.

volkerha commented on June 22, 2024

I tested facebook/opt-6.7b on 8 GPUs with TP=8, FP16. I takes around 28GB per GPU which looks like it's loading the full model parameters (6.7B * 4 bytes ~= 27GB) on every GPU in FP32 (maybe because fp16 is only applied after model loading?).

from deepspeed-mii.

mrwyattii commented on June 22, 2024

@volkerha you are correct, currently with the huggingface provider, we load the full model onto each GPU here. Once we call deepspeed.init_inference on this line, the model gets split across multiple GPU.

I can see how this would be problematic if you don't have enough memory to load the full model on each GPU. We have a workaround that uses meta-tensors (like with the llm provider), but I don't think it's compatible with how we load other huggingface models. @jeffra thoughts on this?

from deepspeed-mii.

Tianwei-She commented on June 22, 2024

@mrwyattii Hi I'm having CUDA OOM errors when loading a EleutherAI/gpt-neox-20b model onto 8 GPUs with TP=8, FP16. Each GPU has 23GB. Is this expected? and does this mean I should use the meta-tensors workaround you mentioned above to load this model? Thanks!

from deepspeed-mii.

OPT in TP or PP mode about deepspeed-mii HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs