What API design would you like to have changed or added to the library? Why?</

I observe: <div class="highlight highlight-source-python notranslate position-rela

Example implementation: <div class="highlight highlight-source-python notranslate

global, eager model weight GPU unloading about diffusers HOT 3 OPEN

doctorpangloss commented on June 21, 2024

global, eager model weight GPU unloading

from diffusers.

Comments (3)

sayakpaul commented on June 21, 2024

We cannot control the level of hierarchy project maintainers want to have in their projects.

For the things that are within our scope of control, we try to explicitly document things (such as enable_model_cpu_offload or enable_sequential_cpu_offload. Together when they are combined with other things such as prompt pre-computing, 8bit inference of the text encoders, etc. they do reduce quite a lot of VRAM consumption.

We cannot do everything on behalf of the users as requirements vary from project to project, but we can provide simple and easy-to-use APIs for the users so that they cover the most common use cases.

On a related note, @Wauplin has started to include a couple of modeling utilities within huggingface_hub (such as the inclusion of a model sharding utility). So, I wonder if this is something to consider for him. Or maybe it lies more within the scope of accelerate (maybe something already exists that I am unable to recollect). So, ccing @muellerzr @SunMarc.

from diffusers.

doctorpangloss commented on June 21, 2024

I observe:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

assert memory_usage(x) + memory_usage(y) > gpu_memory_available()
load(x)
x()
unload(x)
load(y)
y()
unload(y)

reinvented over and over again by downstream users of diffusers and PyTorch.

load and unload are kind of hacky ideas. They only make sense in the kind of 1 GPU, personal computer sort of context, that runs 1 task.

from diffusers.

doctorpangloss commented on June 21, 2024

Example implementation:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0')):
  x(...)
  y(...)

and forward in the mixin would check a contextvar for loaded and unloaded models. alternatively:

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0'), models=[x, y]):
  x(...)
  y(...)

could be implemented now with little issue.

More broadly:

# for example, prefer weights and inferencing in bfloat16, then gptq 8 bit if supported, then bitsandbytes 8 bit, etc.
llm_strategy = BinPacking(
  devices=["cuda:0", "cuda:1", ...],
  levels=[
    BitsAndBytesConfig(...),
    GPTQConfig(...),
    torch.bfloat16
  ]
)

# some models do not perform well at 8 bit for example so shouldn't be used there at all
unet_strategy = BinPacking(
  levels=[torch.float16, torch.bfloat16]
)

# maybe there is a separate inference and weights strategy
t5_strategy = BinPacking(load_in=[torch.float8, torch.float16], compute_in=[torch.float16])

with model_management(strategy=llm_strategy):
  x(...)
  with model_management(strategy=unet_strategy):
    y(...)

Ultimately downstream projects keep writing this, and it fuels a misconception that "Hugging Face" doesn't "run on my machine."

For example: "Automatically loading/unloading models from memory [is something that Ollama supports that Hugging Face does not]"

You guys are fighting pretty pervasive misconceptions too: "Huggingface isn't local"

We cannot control the level of hierarchy project maintainers want to have in their projects.

Perhaps the contextlib approach is best.

from diffusers.

global, eager model weight GPU unloading about diffusers HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs