GithubHelp home page GithubHelp logo

Comments (3)

sayakpaul avatar sayakpaul commented on June 21, 2024

We cannot control the level of hierarchy project maintainers want to have in their projects.

For the things that are within our scope of control, we try to explicitly document things (such as enable_model_cpu_offload or enable_sequential_cpu_offload. Together when they are combined with other things such as prompt pre-computing, 8bit inference of the text encoders, etc. they do reduce quite a lot of VRAM consumption.

We cannot do everything on behalf of the users as requirements vary from project to project, but we can provide simple and easy-to-use APIs for the users so that they cover the most common use cases.

On a related note, @Wauplin has started to include a couple of modeling utilities within huggingface_hub (such as the inclusion of a model sharding utility). So, I wonder if this is something to consider for him. Or maybe it lies more within the scope of accelerate (maybe something already exists that I am unable to recollect). So, ccing @muellerzr @SunMarc.

from diffusers.

doctorpangloss avatar doctorpangloss commented on June 21, 2024

I observe:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

assert memory_usage(x) + memory_usage(y) > gpu_memory_available()
load(x)
x()
unload(x)
load(y)
y()
unload(y)

reinvented over and over again by downstream users of diffusers and PyTorch.

load and unload are kind of hacky ideas. They only make sense in the kind of 1 GPU, personal computer sort of context, that runs 1 task.

from diffusers.

doctorpangloss avatar doctorpangloss commented on June 21, 2024

Example implementation:

x = AutoModel.from_pretrained(...)
y = AutoModel.from_pretrained(...)

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0')):
  x(...)
  y(...)

and forward in the mixin would check a contextvar for loaded and unloaded models. alternatively:

with sequential_offload(offload_device=torch.device('cpu'), load_device=torch.device('cuda:0'), models=[x, y]):
  x(...)
  y(...)

could be implemented now with little issue.

More broadly:

# for example, prefer weights and inferencing in bfloat16, then gptq 8 bit if supported, then bitsandbytes 8 bit, etc.
llm_strategy = BinPacking(
  devices=["cuda:0", "cuda:1", ...],
  levels=[
    BitsAndBytesConfig(...),
    GPTQConfig(...),
    torch.bfloat16
  ]
)

# some models do not perform well at 8 bit for example so shouldn't be used there at all
unet_strategy = BinPacking(
  levels=[torch.float16, torch.bfloat16]
)

# maybe there is a separate inference and weights strategy
t5_strategy = BinPacking(load_in=[torch.float8, torch.float16], compute_in=[torch.float16])

with model_management(strategy=llm_strategy):
  x(...)
  with model_management(strategy=unet_strategy):
    y(...)

Ultimately downstream projects keep writing this, and it fuels a misconception that "Hugging Face" doesn't "run on my machine."

For example: "Automatically loading/unloading models from memory [is something that Ollama supports that Hugging Face does not]"

You guys are fighting pretty pervasive misconceptions too: "Huggingface isn't local"

We cannot control the level of hierarchy project maintainers want to have in their projects.

Perhaps the contextlib approach is best.

from diffusers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.