timdettmers / bitsandbytes Goto Github PK

Accessible large language models via k-bit quantization for PyTorch.

Home Page: https://huggingface.co/docs/bitsandbytes/main/en/index

License: MIT License

Python 61.47% C 0.07% Cuda 24.95% Shell 1.11% C++ 10.79% CMake 1.10% Metal 0.32% Objective-C++ 0.18%

bitsandbytes's Introduction

`bitsandbytes`

The bitsandbytes library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

The library includes quantization primitives for 8-bit & 4-bit operations, through bitsandbytes.nn.Linear8bitLt and bitsandbytes.nn.Linear4bit and 8-bit optimizers through bitsandbytes.optim module.

There are ongoing efforts to support further hardware backends, i.e. Intel CPU + GPU, AMD GPU, Apple Silicon. Windows support is quite far along and is on its way as well.

Please head to the official documentation page:

https://huggingface.co/docs/bitsandbytes/main

License

The majority of bitsandbytes is licensed under MIT, however small portions of the project are available under separate license terms, as the parts adapted from Pytorch are licensed under the BSD license.

We thank Fabio Cannizzo for his work on FastBinarySearch which we use for CPU quantization.

bitsandbytes's People

Contributors

Stargazers

Watchers

Forkers

younesbelkada alex-snd techthiyanes nguyendoanquyet brando90 aninrusimha chessgecko jaedukseo tanvirarafin baek-parallel-computing cgranstrom zaptrem triple-mu mbrukman dbaranchuk joskid jawaechan marcus-arcadius cdj0311 shenghuacheng ujinyang arcral alexkoff88 zhihao-chen limx59 occtop alexander-camuto xiaomin-d vpoul tomaarsen fufeisi atveit sunilsurineni mustafafayez cxz datomi79 hhtwwf hmlatapie dango233 cyberes yyht lostmsu ht-zhou fa0311 gg-big-org ajunlonglive jerin-thirdai mdmmn378 snyhlxde1 macguyversmusic blackhc the-beee borzik nicknickgo kashif indigoviolet stas00 lipovsek hnez huibinshen pkurainbow centerionware autra-weiliu telpirion orendar ther-nullptr denliner markhng525 cualquiercosa327 pminervini ezhomelabs justheuristic 4agi ranchlai fuhengwu2021 ubik2 luzhongqiu phwang20 tloen lucidrains parskatt dmahan93 ishine zineos jcharlet a-leut frankinwi tobbez svgsponer alinous adambear leezjs aashrith-v adamjm smksyj alimagic kaiinui deeprnd niclimcy ddan-io

bitsandbytes's Issues

UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...

I'm getting a weird error and can't figure out what is wrong. I installed before whithout issues...

/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:106: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths... f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain ' /usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('{"kernelManagerProxyPort"'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"'), PosixPath('true}')} "WARNING: The following directories listed in your path were found to " /usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')} "WARNING: The following directories listed in your path were found to " /usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')} "WARNING: The following directories listed in your path were found to " CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 112 CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so... Caching latents: 100% 50/50 [00:15<00:00, 3.28it/s] Steps: 0% 0/1000 [00:00<?, ?it/s]Traceback (most recent call last): File "train_dreambooth.py", line 646, in <module> main() File "train_dreambooth.py", line 591, in main noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/accelerate/utils/operations.py", line 507, in __call__ return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/usr/local/lib/python3.7/dist-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 327, in forward upsample_size=upsample_size, File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 1149, in forward hidden_states = attn(hidden_states, context=encoder_hidden_states) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 169, in forward hidden_states = block(hidden_states, context=context) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 218, in forward hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 293, in forward hidden_states = self._attention(query, key, value) File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 302, in _attention attention_probs = attention_scores.softmax(dim=-1) RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.29 GiB already allocated; 945.75 MiB free; 12.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Steps: 0% 0/1000 [00:02<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=/content/data/davidPozo', '--class_data_dir=/content/data/guy', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/davidPozo', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=davidPozo', '--class_prompt=guy', '--seed=1337', '--resolution=512', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=50', '--sample_batch_size=4', '--max_train_steps=1000']' returned non-zero exit status 1.

Cuda version 11.8 eta?

Hi, I just reinstalled cuda and picked the latest version 11.8. I dont see a file for libbitsandbytes_cuda118.so. I'll uninstall and downgrade.

Is it possible to use int8 for other task?

From your paper and blog post, the quantization was tested using transformers. Is it possible to use the library for objet detection models and will it experience performance degradation?

raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

I was running the !accelerate launch train_dreambooth.py
and come up with this error

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--num_cpu_threads_per_process` was set to `1` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:99: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('{"kernelManagerProxyPort"'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('true}'), PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('"172.28.0.3","jupyterArgs"')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  "WARNING: The following directories listed in your path were found to "
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...
Caching latents: 100% 12/12 [00:11<00:00,  1.04it/s]
Steps:   0% 0/900 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train_dreambooth.py", line 657, in <module>
    main()
  File "train_dreambooth.py", line 600, in main
    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 309, in forward
    upsample_size=upsample_size,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 1151, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 154, in forward
    hidden_states = block(hidden_states, context=context)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 203, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 278, in forward
    hidden_states = self._attention(query, key, value)
  File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 287, in _attention
    attention_probs = attention_scores.softmax(dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.29 GiB already allocated; 977.75 MiB free; 12.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0% 0/900 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/yasirInput', '--class_data_dir=/content/data/person', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/yasirOutput', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=photosofyasir', '--class_prompt=person', '--seed=1337', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=12', '--sample_batch_size=4', '--max_train_steps=900']' returned non-zero exit status 1.

Training Gives an Error

**Steps: 0% 2/2500 [00:09<3:18:12, 4.76s/it, loss=0.239, lr=5e-6]Traceback (most recent call last):
File "train_dreambooth.py", line 658, in
main()
File "train_dreambooth.py", line 618, in main
accelerator.backward(loss)
File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 882, in backward
self.scaler.scale(loss).backward(kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 166, in backward
grad_tensors = make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 68, in _make_grads
new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Steps: 0% 2/2500 [00:10<3:35:23, 5.17s/it, loss=0.239, lr=5e-6]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/sks', '--class_data_dir=/content/data/guy', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/sks', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=sks', '--class_prompt=guy', '--seed=1337', '--resolution=896', '--center_crop', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--cache_latents', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=12', '--sample_batch_size=4', '--max_train_steps=2500']' returned non-zero exit status 1.

Cannot load it with T5 - RTX 5000, Cuda 11.3

When i try:

from transformers import T5ForConditionalGeneration,T5Tokenizer,T5TokenizerFast
model2 = T5ForConditionalGeneration.from_pretrained("3b_m1", device_map='auto' , load_in_8bit=True)

I get:

TypeError: __init__() got an unexpected keyword argument 'load_in_8bit'

EDIT this error stopped appearing after i restarted the kernel, but now I get this error:

#######################

  /opt/conda/lib/python3.7/site-packages/bitsandbytes/functional.py in get_colrow_absmax(A, row_stats, col_stats,    nnz_block_ptr, threshold)

1494 prev_device = pre_call(A.device)
1495 is_on_gpu([A, row_stats, col_stats, nnz_block_ptr])
-> 1496 lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
1497 post_call(prev_device)
1498

/opt/conda/lib/python3.7/ctypes/init.py in getattr(self, name)
375 if name.startswith('') and name.endswith(''):
376 raise AttributeError(name)
--> 377 func = self.getitem(name)
378 setattr(self, name, func)
379 return func

/opt/conda/lib/python3.7/ctypes/init.py in getitem(self, name_or_ordinal)
380
381 def getitem(self, name_or_ordinal):
--> 382 func = self._FuncPtr((name_or_ordinal, self))
383 if not isinstance(name_or_ordinal, int):
384 func.name = name_or_ordinal

AttributeError: /opt/conda/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol:     cget_col_row_stats

import transformers
print(transformers.version)
4.22.0.dev0

GPU: RTX 5000

!conda list | grep cudatoolkit
cudatoolkit 11.3.1

Ran into crashes when testing LLM.int8() from transformers

Hi, I was testing LLM.int8() on the LongT5 model, but I consistently ran into the following errors:

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /opt/conda/envs/python38/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...

python3: /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu:375: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t*, const int8_t*, void*, float*, int, int, int) [with int FORMATB = 3; int DTYPE_OUT = 32; int SCALE_ROWS = 0; cublasLtHandle_t = cublasLtContext*; int8_t = signed char]: Assertion `false' failed.
Aborted

Sample script to reproduce:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('google/t5-v1_1-large')
model_8bit = AutoModelForSeq2SeqLM.from_pretrained('google/t5-v1_1-large', device_map="auto", load_in_8bit=True)

sentences = ['hello world']

inputs = tokenizer(sentences, return_tensors="pt", padding=True)

output_sequences = model_8bit.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=256
)

print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))

T5 conversion issue

Not sure where this issue belongs, but figured I'd put it here in case anyone else has the same issue

when running generate on a converted t5 model I got the following error:
AttributeError: 'Parameter' object has no attribute 'CB'

It turned out that T5ForConditionalGeneration.named_parameters() didn't iterate over lm_head, so I was able to fix it by changing

https://github.com/huggingface/transformers/blob/c8b6ae858d61e5bc10e388d095aa74f7690d1021/src/transformers/utils/bitsandbytes.py#L139-L142

    # otherwise they have an attached head
    list_modules = list(model.named_parameters())
    last_name = list_modules[-1][0]
    return last_name.split(".")[0]

return "lm_head"

not sure if it's just my version of torch or where to put the issue but env:
torch: 1.13.0a0+340c412
cuda: 11.7
bnb: 0.31.8
transformers: 4.22.0.dev0

load_in_8bit is not working for some huggingface model

I have updated the transformers package and I am using ViLT model: https://huggingface.co/docs/transformers/model_doc/vilt#transformers.ViltForQuestionAnswering

I am getting this error is load_in_8bit is not integrated will all hugging face models ? Could you please let me know how to use load_in_8bit for any huggingface model not just BLOOM and T5.

Bug: RuntimeError: "topk_cpu" not implemented for 'Half'

When using a bloom model with generate, I get

 RuntimeError: "topk_cpu" not implemented for 'Half'

when do_sample=True.

I.e.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MAX_NEW_TOKENS = 128
model_name = 'bigscience/bloom-560m'

text = """
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. 
How many punches did he throw?\n
A: Let’s think step by step.\n"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{free_in_GB-6}GB'

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  cache_dir="/home/code/fm_in_context_eval/transformers_cache",
  device_map='auto', 
  load_in_8bit=True, 
  max_memory=max_memory
)
generated_ids = model.generate(input_ids, max_length=len(input_ids[0])+1, do_sample=True)

I wonder how does this compare and if you could leverage Nvidia transformer engine

In the 2023 architecture Hopper:

The Transformer Engine intelligently manages and dynamically chooses between FP8 and 16-bit calculations, automatically handling re-casting and scaling between FP8 and 16-bit in each layer to deliver up to 9x faster AI training and up to 30x
faster AI inference speedups on large language models compared to the prior generation A100.

ValueError("8-bit operations on `bitsandbytes` are not supported under CPU!")

Hi Tim,

Thanks for your awesome work!

I'm using your method to load the largest BLOOM model (the BLOOM model with 176b parameters) onto 1 node with 8 GPUs.

model = AutoModelForCausalLM.from_pretrained(
                "bloom", 
                device_map="auto", 
                load_in_8bit=True,
            )

This line works for all the other smaller bloom models, eg. bloom-7b1. However when loading bloom (176b) I got error "8-bit operations on bitsandbytes are not supported under CPU!".

File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2182, in from_pretrained
    raise ValueError("8-bit operations on `bitsandbytes` are not supported under CPU!")
ValueError: 8-bit operations on `bitsandbytes` are not supported under CPU!

In my understanding, this is because some modules of the model are automatically loaded onto CPU, which didn't happen to the smaller models. Is there a way to force the model to be loaded to GPU only? or do you have any advice on how to bypass this error? Thanks!!

Tianwei

TypeError: argument of type 'WindowsPath' is not iterable

I seem to be having an issue loading the bitsandbytes library in Windows (works on Collab without issue). EDIT: not sure if bitsandbytes alone or something else: apologies if this is unrelated.

Error message:

Traceback (most recent call last):
  File "C:\Users\User\text-gen\alt2.py", line 4, in <module>
    import bitsandbytes
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\autograd\_functions.py", line 4, in <module>
    import bitsandbytes.functional as F
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\functional.py", line 14, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\cextension.py", line 41, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\cextension.py", line 37, in get_instance
    cls._instance.initialize()
  File "C:\Users\User\text-gen\textenv\lib\site-packages\bitsandbytes\cextension.py", line 31, in initialize
    self.lib = ct.cdll.LoadLibrary(binary_path)
  File "E:\Anaconda\lib\ctypes\__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
  File "E:\Anaconda\lib\ctypes\__init__.py", line 364, in __init__
    if '/' in name or '\\' in name:
TypeError: argument of type 'WindowsPath' is not iterable

A search around the issue gave me this: https://bugs.python.org/issue39243
The code that I'm trying to run is essentially trying to get the hivemind gpt-j 8-bit model running on my 1080ti (again, works in Collab).


UPDATE: 
Same issue even with a fresh install and simply attempting to loading the library

(base) C:\Users\User\text-gen>pip install git+https://github.com/TimDettmers/bitsandbytes
Collecting git+https://github.com/TimDettmers/bitsandbytes
Cloning https://github.com/TimDettmers/bitsandbytes to c:\users\user\appdata\local\temp\pip-req-build-vh0wtmvn
Running command git clone --filter=blob:none --quiet https://github.com/TimDettmers/bitsandbytes 'C:\Users\User\AppData\Local\Temp\pip-req-build-vh0wtmvn'
Resolved https://github.com/TimDettmers/bitsandbytes to commit f0ae860
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done

(base) C:\Users\User\text-gen>python
Python 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.

import bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('E')}
E:\Anaconda\lib\site-packages\bitsandbytes\cuda_setup\paths.py:98: UserWarning: E:\Anaconda did not contain libcudart.so as expected! Searching further paths...
warn(
WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/SteamLibrary/steamapps/common/Besiege/Besiege_Data/Managed'), WindowsPath('F')}
WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/SteamLibrary/steamapps/common/Besiege/Besiege_Data/Managed'), WindowsPath('F')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary E:\Anaconda\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
Traceback (most recent call last):
File "", line 1, in
File "E:\Anaconda\lib\site-packages\bitsandbytes_init_.py", line 6, in
from .autograd._functions import (
File "E:\Anaconda\lib\site-packages\bitsandbytes\autograd_functions.py", line 4, in
import bitsandbytes.functional as F
File "E:\Anaconda\lib\site-packages\bitsandbytes\functional.py", line 14, in
from .cextension import COMPILED_WITH_CUDA, lib
File "E:\Anaconda\lib\site-packages\bitsandbytes\cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "E:\Anaconda\lib\site-packages\bitsandbytes\cextension.py", line 37, in get_instance
cls.instance.initialize()
File "E:\Anaconda\lib\site-packages\bitsandbytes\cextension.py", line 31, in initialize
self.lib = ct.cdll.LoadLibrary(binary_path)
File "E:\Anaconda\lib\ctypes_init.py", line 460, in LoadLibrary
return self.dlltype(name)
File "E:\Anaconda\lib\ctypes_init.py", line 364, in init
if '/' in name or '\' in name:
TypeError: argument of type 'WindowsPath' is not iterable

Memory Decreases! But Latency Increases....

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

`OSError: File name too long` when environment variable is too long

Hello!

Summary

Long environment variable values containing / cause OSError: File name too long when bitsandbytes checks if the values are existing directories. This is the case e.g. if users have an environment variable containing a long bash script.

Background

As you know, step 3 of the approach of finding the CUDA path involves iterating over environment variables:

bitsandbytes/bitsandbytes/cuda_setup/env_vars.py

Lines 46 to 51 in 9b5f2ed

 def get_potentially_lib_path_containing_env_vars() -> Dict[str, str]: 

 return { 

 env_var: value 

 for env_var, value in os.environ.items() 

 if is_relevant_candidate_env_var(env_var, value) 

 }

For all environment variables that may contain a path, bitsandbytes checks if perhaps that variable points to the CUDA runtime library. One of the steps of this process is verifying whether the paths in the environment variables actually exist:

bitsandbytes/bitsandbytes/cuda_setup/paths.py

Lines 14 to 25 in 9b5f2ed

 def remove_non_existent_dirs(candidate_paths: Set[Path]) -> Set[Path]: 

 non_existent_directories: Set[Path] = { 

 path for path in candidate_paths if not path.exists() 

 } 

 if non_existent_directories: 

 warn( 

 "WARNING: The following directories listed in your path were found to " 

 f"be non-existent: {non_existent_directories}" 

 ) 

 return candidate_paths - non_existent_directories

Bug details

In the previous snippet, the top set comprehension calls path.exists(), which can throw an exception if the environment value is too long, but does contain the path separator /. See the traceback below for details. Note that I've manually removed the latter half of the environment value that shows up in the traceback. It's just a simple script, but I'd rather not leak it.

Traceback

Traceback (most recent call last):
  File "./train.py", line 11, in <module>
    model = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
  File "[[removed]]/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    return model_class.from_pretrained(
  File "[[removed]]/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2159, in from_pretrained
    from .utils.bitsandbytes import get_key_to_not_convert, replace_8bit_linear
  File "[[removed]]/lib/python3.8/site-packages/transformers/utils/bitsandbytes.py", line 10, in <module>
    import bitsandbytes as bnb
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 3, in <module>
    import bitsandbytes.functional as F
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 41, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 37, in get_instance
    cls._instance.initialize()
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 15, in initialize
    binary_name = evaluate_cuda_setup()
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py", line 123, in evaluate_cuda_setup
    cudart_path = determine_cuda_runtime_lib_path()
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py", line 110, in determine_cuda_runtime_lib_path
    cuda_runtime_libs.update(find_cuda_lib_in(value))
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py", line 46, in find_cuda_lib_in
    resolve_paths_list(paths_list_candidate)
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py", line 41, in resolve_paths_list
    return remove_non_existent_dirs(extract_candidate_paths(paths_list_candidate))
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py", line 15, in remove_non_existent_dirs
    non_existent_directories: Set[Path] = {
  File "[[removed]]/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py", line 16, in <setcomp>
    path for path in candidate_paths if not path.exists()
  File "/usr/lib/python3.8/pathlib.py", line 1407, in exists
    self.stat()
  File "/usr/lib/python3.8/pathlib.py", line 1198, in stat
    return self._accessor.stat(self)
OSError: [Errno 36] File name too long: '-0}" = \'1\' ]; then\n case [[removed]]'

Tom Aarsen

RuntimeError: CUDA error: no kernel image is available for execution on the device

Bloom generation generated repeated characters

Firstly: Fantastic work! This is the way!

I followed the instructions in your doc file where instead of opt66b I used bloom and bloom-3b.

The models load properly on my 8 V100 32GB gpus (3b needs 1 gpu obviously).

Decoding also finishes but the output is problematic:

My input: text = """The translation of 'I am a boy' in French is"""
My output: The translation of 'I am a boy' in French is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is

This happens for both models.

Some details about my settings:

V100 gpus
transformers-4.22.0.dev0
CUDA 11.1
CUDNN 8.x
bitsandbytes (I am assuming its the latest version copatible with cuda 11.x)

Kindly let me know how this can be fixed.

Thanks and regards.

Anything but plain "greedy" search "not implemented for 'Half'"

I posted this issue to GH/HF/transformers, but it probably belongs here.
huggingface/transformers#19445 (same title)

Also, this may help. This Eleuther model is able to use sampling with int8 optimization. They actually refer to Load_in_8bit saying it has superceded their solution, but it doesnt do sampling.
"superceded bloom int8 optimized model doesnt do sampling"
https://huggingface.co/hivemind/gpt-j-6B-8bit/discussions/8
https://huggingface.co/hivemind/gpt-j-6B-8bit

Thanks Tim!

WARNING! libcudart.so not found in any environmental path.

/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:106: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('{"kernelManagerProxyPort"'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"'), PosixPath('true}')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:28: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
"WARNING: The following directories listed in your path were found to "
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda112.so...
Caching latents: 0% 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train_dreambooth.py", line 638, in
main()
File "train_dreambooth.py", line 493, in main
for batch in tqdm(train_dataloader, desc="Caching latents"):
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "train_dreambooth.py", line 262, in getitem
instance_image = Image.open(self.instance_images_path[index % self.num_instance_images])
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2896, in open
"cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file '/content/data/sks/IMG_3133.HEIC'
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=/content/data/sks', '--class_data_dir=/content/data/guy', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/sks', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=noahnerd', '--class_prompt=guy', '--seed=1337', '--resolution=512', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=50', '--sample_batch_size=4', '--max_train_steps=1000']' returned non-zero exit status 1.

Unable to use load_in_8bit when the model is shared between GPU and CPU

It seems like bitsandbytes can't be used if the model is shared between GPU and CPU.
I could not find any info saying that the entire model must be loaded in GPU in order to use bitsandbytes,
so I'm not sure if this is a bug or the expected behavior.

The environment setup:

pip install --extra-index-url https://download.pytorch.org/whl/cu116 torch==1.12.1+cu116
pip install transformers==4.22.1
pip install accelerate==0.12.0
pip install bitsandbytes==0.33.1

The main.py script:

from transformers import pipeline

auto_map = False
load_in_8bit = True

if auto_map:
    device_map = "auto"
else:
    device_map = {
        "transformer.wte": 0,
        "transformer.wpe": 0,
        "transformer.ln_f": "cpu",
        "lm_head": 0,
        "transformer.h.0": 0,
        "transformer.h.1": "cpu",
        "transformer.h.2": "cpu",
        "transformer.h.3": "cpu",
        "transformer.h.4": "cpu",
        "transformer.h.5": "cpu",
        "transformer.h.6": "cpu",
        "transformer.h.7": "cpu",
        "transformer.h.8": "cpu",
        "transformer.h.9": "cpu",
        "transformer.h.10": "cpu",
        "transformer.h.11": "cpu"
    }

pipe = pipeline(
    model="EleutherAI/gpt-neo-125M",
    max_length=32,
    model_kwargs={
        "device_map": device_map,
        "load_in_8bit": load_in_8bit
    }
)

print("\n", pipe("It was")[0]["generated_text"])

The auto_map and load_in_8bit control the script settings.

When you run the script with auto_map = False and load_in_8bit = True then it crashes with this error:

❯ python main.py
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1.44k/1.44k [00:00<00:00, 634kB/s]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/home/user/.gtkrc'), PosixPath('/etc/gtk/gtkrc')}

[... lots of similar warnings about non-existent paths ...]

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Traceback (most recent call last):
  File "/home/user/test/bnb-test/main.py", line 37, in <module>
    print("\n", pipe("It was")[0]["generated_text"])
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 176, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1074, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1081, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 990, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 218, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1319, in generate
    return self.greedy_search(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/generation_utils.py", line 1713, in greedy_search
    outputs = self(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 744, in forward
    transformer_outputs = self.transformer(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 623, in forward
    outputs = block(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 328, in forward
    attn_outputs = self.attn(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 280, in forward
    return self.attention(
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 224, in forward
    query = self.q_proj(hidden_states)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 148, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 256, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 391, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 254, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/user/test/bnb-test/.venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1604, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

All other combinations of auto_map and load_in_8bit produce no error and give the generated_text.

ROCM Support

bitsandbytes seems to be hardcoded to search for specific cuda libraries which don't seem to be provided the same way by rocm

/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /root/anaconda3 did not contain libcudart.so as expected! Searching further paths...
warn(
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:98: UserWarning: /opt/ompi/lib:/opt/rocm/lib:/usr/local/lib: did not contain libcudart.so as expected! Searching further paths...
warn(
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')}
warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /root/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn(

WSL: libcuda.so: cannot open shared object file

Hello, I'm on a completely fresh Windows Subsystem for Linux with CUDA support installation of Ubuntu. What I've done so far:

Installed MiniForge/Conda
Made a new env in a folder
Ran conda install cudatoolkit
Pip installed pytorch, transformers, accelerate, and bitsandbytes
Attempted to run the HF Pipeline demo with 8 bit quant enabled.

When running a model, I get the following error:

CUDA SETUP: CUDA runtime path found: /home/zaptrem/bigmodels/env/lib/libcudart.so
Traceback (most recent call last):
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py", line 57, in get_cuda_lib_handle
    cuda = ctypes.CDLL("libcuda.so")
  File "/home/zaptrem/bigmodels/env/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1030, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/home/zaptrem/bigmodels/env/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 34, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/modeling_utils.py", line 88, in <module>
    from .utils.bitsandbytes import get_key_to_not_convert, replace_8bit_linear, set_module_8bit_tensor_to_device
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/utils/bitsandbytes.py", line 10, in <module>
    import bitsandbytes as bnb
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 4, in <module>
    import bitsandbytes.functional as F
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/functional.py", line 14, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 41, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 37, in get_instance
    cls._instance.initialize()
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 15, in initialize
    binary_name = evaluate_cuda_setup()
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py", line 130, in evaluate_cuda_setup
    cuda = get_cuda_lib_handle()
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py", line 60, in get_cuda_lib_handle
    raise Exception('CUDA SETUP: ERROR! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!')
Exception: CUDA SETUP: ERROR! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "largemodels.py", line 25, in <module>
    pipe = pipeline(model=name, model_kwargs= {"device_map": "auto", "load_in_8bit": True}, max_new_tokens=max_new_tokens)
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 676, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/pipelines/base.py", line 229, in infer_framework_load_model
    _class = getattr(transformers_module, architecture, None)
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1021, in __getattr__
    value = getattr(module, name)
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1020, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/zaptrem/bigmodels/env/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1032, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.bloom.modeling_bloom because of the following error (look up to see its traceback):
CUDA SETUP: ERROR! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!

Like I said before, I ensured this wasn't an environment issue by completely resetting the Linux environment. CUDA works fine otherwise.

[v0.33.0] Colab CPU installation gives list index out of range

In a blank colab CPU noteboook, you can install like this,

!pip install bitsandbytes==0.33.0
import bitsandbytes

but the import fails

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/main.py](https://localhost:8080/#) in get_compute_capability(cuda)
    106     if ccs is not None:
    107         # TODO: handle different compute capabilities; for now, take the max
--> 108         return ccs[-1]
    109     return None
    110 

IndexError: list index out of range

Here's the example colab notebook

Full traceback

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitsandbytes==0.33.0
  Downloading bitsandbytes-0.33.0-py3-none-any.whl (55.9 MB)
     |████████████████████████████████| 55.9 MB 214 kB/s 
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.33.0

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA exception! Error code: no CUDA-capable device is detected
CUDA exception! Error code: initialization error

/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:99: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('true}'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('{"kernelManagerProxyPort"'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  "WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  "WARNING: The following directories listed in your path were found to "

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

[<ipython-input-1-e9a23814d8e9>](https://localhost:8080/#) in <module>
      1 get_ipython().system('pip install bitsandbytes==0.33.0')
----> 2 import bitsandbytes

7 frames

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/__init__.py](https://localhost:8080/#) in <module>
      4 # LICENSE file in the root directory of this source tree.
      5 
----> 6 from .autograd._functions import (
      7     MatmulLtState,
      8     bmm_cublas,

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/autograd/_functions.py](https://localhost:8080/#) in <module>
      1 import operator
      2 import torch
----> 3 import bitsandbytes.functional as F
      4 
      5 from dataclasses import dataclass

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/functional.py](https://localhost:8080/#) in <module>
     11 from torch import Tensor
     12 
---> 13 from .cextension import COMPILED_WITH_CUDA, lib
     14 from functools import reduce  # Required in Python 3
     15 

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cextension.py](https://localhost:8080/#) in <module>
     39 
     40 
---> 41 lib = CUDALibrary_Singleton.get_instance().lib
     42 try:
     43     lib.cadam32bit_g32

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cextension.py](https://localhost:8080/#) in get_instance(cls)
     35         if cls._instance is None:
     36             cls._instance = cls.__new__(cls)
---> 37             cls._instance.initialize()
     38         return cls._instance
     39 

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cextension.py](https://localhost:8080/#) in initialize(self)
     13 
     14     def initialize(self):
---> 15         binary_name = evaluate_cuda_setup()
     16         package_dir = Path(__file__).parent
     17         binary_path = package_dir / binary_name

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/main.py](https://localhost:8080/#) in evaluate_cuda_setup()
    130     print(f"CUDA SETUP: CUDA runtime path found: {cudart_path}")
    131     cuda = get_cuda_lib_handle()
--> 132     cc = get_compute_capability(cuda)
    133     print(f"CUDA SETUP: Highest compute capability among GPUs detected: {cc}")
    134     cuda_version_string = get_cuda_version(cuda, cudart_path)

[/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/main.py](https://localhost:8080/#) in get_compute_capability(cuda)
    106     if ccs is not None:
    107         # TODO: handle different compute capabilities; for now, take the max
--> 108         return ccs[-1]
    109     return None
    110 

IndexError: list index out of range

UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--num_cpu_threads_per_process was set to 1 to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:99: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('{"kernelManagerProxyPort"'), PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('true}'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"'), PosixPath('6000,"kernelManagerProxyHost"')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
"WARNING: The following directories listed in your path were found to "
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...
Caching latents: 75% 9/12 [00:09<00:03, 1.04s/it]
Traceback (most recent call last):
File "train_dreambooth.py", line 637, in
main()
File "train_dreambooth.py", line 492, in main
for batch in tqdm(train_dataloader, desc="Caching latents"):
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "train_dreambooth.py", line 261, in getitem
instance_image = Image.open(self.instance_images_path[index % self.num_instance_images])
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2896, in open
"cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file '/content/data/sks/IMG_3819.HEIC'
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=/content/data/sks', '--class_data_dir=/content/data/person', '--output_dir=/content/drive/MyDrive/stable_diffusion_weights/sks', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=photo of sks person', '--class_prompt=photo of a person', '--seed=1337', '--resolution=512', '--train_batch_size=1', '--mixed_precision=fp16', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=12', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1.

fast-DreamBooth error "usr/lib64-nvidia did not contain libcudart.so"

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:99: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
f'{candidate_env_vars["LD_LIBRARY_PATH"]} did not contain '
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('{"kernelManagerProxyPort"'), PosixPath('["--ip=172.28.0.2"],"debugAdapterMultiplexerPath"'), PosixPath('"172.28.0.3","jupyterArgs"'), PosixPath('6000,"kernelManagerProxyHost"'), PosixPath('true}'), PosixPath('"/usr/local/bin/dap_multiplexer","enableLsp"')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
"WARNING: The following directories listed in your path were found to "
/usr/local/lib/python3.7/dist-packages/bitsandbytes/cuda_setup/paths.py:21: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
"WARNING: The following directories listed in your path were found to "
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111.so...
Steps: 0% 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 606, in
main()
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 550, in main
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/accelerate/utils/operations.py", line 507, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.7/dist-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_2d_condition.py", line 254, in forward
encoder_hidden_states=encoder_hidden_states,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/diffusers/models/unet_blocks.py", line 565, in forward
hidden_states = attn(hidden_states, context=encoder_hidden_states)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 167, in forward
hidden_states = block(hidden_states, context=context)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 217, in forward
hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/diffusers/models/attention.py", line 287, in forward
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
File "/usr/local/lib/python3.7/dist-packages/xformers/ops.py", line 626, in memory_efficient_attention
return op.apply(query, key, value, attn_bias, p)
File "/usr/local/lib/python3.7/dist-packages/xformers/ops.py", line 264, in forward
causal=causal,
File "/usr/local/lib/python3.7/dist-packages/torch/_ops.py", line 143, in call
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Steps: 0% 0/800 [00:09<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/diffusers/examples/dreambooth/train_dreambooth.py', '--pretrained_model_name_or_path=/content/gdrive/MyDrive/stable-diffusion-v1-4', '--instance_data_dir=/content/data/prak1', '--output_dir=/content/models/prak1', '--instance_prompt=photo of prak1 person', '--seed=4545', '--resolution=512', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--use_8bit_adam', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=800']' returned non-zero exit status 1.

Missing Windows support

Currently, the library uses precompiled Linux binaries. I am unsure how compatible these are with standard PyTorch installs on Windows. It might be that the binaries need to be compiled against mingw32/64 to create functional binaries for Windows.

The most helpful would be a case where a person is able to compile from source and use the library. This will require altering the Makefile file. If this works, we can add instructions on compiling for Windows as a first step before doing a full-scale Windows deployment of binaries on pip.

Since I do not have a Windows machine, any help is wanted on this!

Crash on training

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.0
CUDA SETUP: Detected CUDA version 111
CUDA SETUP: Loading binary /usr/local/lib/python3.7/dist-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so...
Steps: 2% 21/1400 [00:48<45:43, 1.99s/it, loss=0.0183, lr=5e-6]Traceback (most recent call last):
File "train_dreambooth.py", line 606, in
main()
File "train_dreambooth.py", line 527, in main
for step, batch in enumerate(train_dataloader):
File "/usr/local/lib/python3.7/dist-packages/accelerate/data_loader.py", line 357, in iter
next_batch = next(dataloader_iter)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "train_dreambooth.py", line 268, in getitem
instance_image = Image.open(self.instance_images_path[index % self.num_instance_images])
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2843, in open
fp = builtins.open(filename, "rb")
IsADirectoryError: [Errno 21] Is a directory: '/content/data/Paula/.ipynb_checkpoints'
Steps: 2% 21/1400 [00:48<53:19, 2.32s/it, loss=0.0183, lr=5e-6]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=/content/data/Paula', '--class_data_dir=/content/data/woman', '--output_dir=/content/models/PaulaOut', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=Paulina Gajewska', '--class_prompt=woman', '--seed=1393795', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=no', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=12', '--sample_batch_size=4', '--max_train_steps=1400']' returned non-zero exit status 1.

Returned non-zero exit status 1. (Already using WSL 2) can someone help please?

(diffusers) zerocool@DESKTOP-IFR124:~/github/diffusers/examples/dreambooth$ ./my_training.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /home/zerocool/anaconda3/envs/diffusers did not contain libcudart.so as expected! Searching further paths...
  warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:98: UserWarning: /usr/lib/wsl/lib: did not contain libcudart.so as expected! Searching further paths...
  warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')}
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
Caching latents: 100%|██████████████████████████████████████████████████████████████████| 27/27 [00:03<00:00,  7.42it/s]
Traceback (most recent call last):
  File "/home/zerocool/github/diffusers/examples/dreambooth/train_dreambooth.py", line 637, in <module>
    main()
  File "/home/zerocool/github/diffusers/examples/dreambooth/train_dreambooth.py", line 533, in main
    accelerator.init_trackers("dreambooth", config=vars(args))
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 1061, in init_trackers
    tracker_init(project_name, self.logging_dir, **init_kwargs.get(str(tracker), {}))
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/tracking.py", line 133, in __init__
    self.writer = tensorboard.SummaryWriter(self.logging_dir, **kwargs)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 246, in __init__
    self._get_file_writer()
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 276, in _get_file_writer
    self.file_writer = FileWriter(
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/utils/tensorboard/writer.py", line 75, in __init__
    self.event_writer = EventFileWriter(
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in __init__
    tf.io.gfile.makedirs(logdir)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 900, in makedirs
    return get_filesystem(path).makedirs(path)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 201, in makedirs
    os.makedirs(path, exist_ok=True)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: 'my_model/logs'
Traceback (most recent call last):
  File "/home/zerocool/anaconda3/envs/diffusers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zerocool/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=training', '--output_dir=my_model', '--instance_prompt=beaninstance', '--class_prompt=guy', '--resolution=512', '--train_batch_size=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--max_train_steps=1000']' returned non-zero exit status 1.

WARNING: The following directories listed in your path were found to be non-existent

(diffusers) zerocool@DESKTOP-IFR8E96:~/github/diffusers/examples/dreambooth$ ./my_training_press.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
WARNING:root:Blocksparse is not available: the current GPU does not expose Tensor cores

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /home/zerocool/anaconda3/envs/diffusers did not contain libcudart.so as expected! Searching further paths...
  warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:98: UserWarning: /usr/lib/wsl/lib: did not contain libcudart.so as expected! Searching further paths...
  warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')}
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
Caching latents:  18%|██████████▋                                                  | 1771/10110 [06:46<31:51,  4.36it/s]

Can i ignore these warnings?

Tesla A100 40gb on Colab training ERROR

I finally got a A100 40gb on colab but this error appeared on training :(

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--num_cpu_threads_per_process was set to 6 to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

about the requirements of GPU

Currently I only have two types of GPUs:
GP102 [GeForce GTX 1080 Ti]
and
GM200GL [Tesla M40]

In this case I cannot use this library right?
Or could anyone tell me is this okay?

Error when import bnb.nn.Linear8bitLt

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.3/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda113_nocublaslt.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
Traceback (most recent call last):
File "1.py", line 1, in
import bitsandbytes as bnb
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/init.py", line 6, in
from .autograd._functions import (
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/autograd/_functions.py", line 4, in
import bitsandbytes.functional as F
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/functional.py", line 14, in
from .cextension import COMPILED_WITH_CUDA, lib
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/cextension.py", line 37, in get_instance
cls._instance.initialize()
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-speech/users/lisong39/python_packs/lib/python3.8/site-packages/bitsandbytes-0.32.1-py3.8.egg/bitsandbytes/cextension.py", line 27, in initialize
raise Exception('CUDA SETUP: Setup Failed!')
Exception: CUDA SETUP: Setup Failed!

Cant find libcudart.so

I am running on windows, using miniconda3 and python 3.9.

I have cudatoolkit, cudnn, pytorch, transformers, accelerate, bitsandbytes, and dependencies installed via conda.

when attempting to run a simple test script:

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BloomModel, BloomConfig
from transformers import BloomTokenizerFast, BloomForCausalLM
from transformers import BloomForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("E:/MLModels/bloom")
model = BloomForCausalLM.from_pretrained("E:/MLModels/bloom", device_map="auto", load_in_8bit=True)

prompt = "It was a dark and stormy night"
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")

tokenout = model.generate(inputs["input_ids"], max_length=result_length)

I see this error when running from a vscode session:

Exception has occurred: RuntimeError
Failed to import transformers.models.bloom.modeling_bloom because of the following error (look up to see its traceback):
argument of type 'WindowsPath' is not iterable

and this output:

WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C'), WindowsPath('/ProgramData/Miniconda3/envs/llm/lib')}
C:\ProgramData\Miniconda3\envs\llm\lib\site-packages\bitsandbytes\cuda_setup\paths.py:97: UserWarning: C:\ProgramData\Miniconda3\envs\llm did not contain libcudart.so as expected! Searching further paths...
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!   
CUDA SETUP: Loading binary C:\ProgramData\Miniconda3\envs\llm\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...

I see that its searching for libcudart.so, which is non-existent on my machine.

Is this file supposed to exist in windows? Or do I need to do some trickery to get this working on windows?

bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.

Hi, I'm trying to use the 8-bit optimizer with an A100 on a OpenPBS environment, which means that the machine I install the python virtual environment, with the bitsandbytes lib, doesn't have the GPU, but when I submit the job it does.
Hence, I'm getting this error:
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.

Which makes me think it's because I installed the library on the machine without GPU. How can I install it with GPU support on a machine without GPU? I'm using cu116 btw

PS.When I submit the job, the target machine doesn't have access to internet, so it's not able to fetch the library or anything else online.

Thanks in advance

Fine tuning with int8 and NLP models...is stable embedding needed?

Thanks for the great work on the optimizer quantization!
I'm trying to fine tune a T5 model using the adam 8 bit...but I'm finding the val loss is significantly worse (i.e.10x) vs using BFloat16 or FP32 optimizer states.
It does train stably in terms of steadily improving loss, but the starting loss is so far behind that it's not practical.

I'm wondering if we thus need to employ the stable embeddings for t5 and fine tuning...but if so, how do we do that without negatively affecting the already trained embeddings?

Or is the stable embeddings designed solely for the 'train from scratch' scenario and this high loss is due to other factors (i.e. t5 was trained in BFloat16 instead of FP32)?
Thanks for any insights!

How to recover the float16 weight?

Say I have a Linear8bitLt module with int8 weight on GPU, which is converted from a nn.Linear module with float16 weight. How could I restore the float16 weight, so that I could do some cusmtomized computation which is not supported by int8?
The blog post A Gentle Introduction to 8-bit Matrix Multiplication mentioned:
"You might also wonder how to retrieve the FP16 weights in order to perform the outlier MatMul in fp16? You can simply do:
(int8_model[0].weight.CB * int8_model[0].weight.SCB) / 127"
This method does not work me because both weight.CB and weight.SCB are None. I also tried:
(int8_model[0].state.CxB * int8_model[0].state.SCB) / 127
but the result is not aligned with the original float16 weight.

FYI, the Linear8bitLt module here is from EleutherAI/gpt-neox-20b - Hugging Face: GPTNeoXForCausalLM.gpt_neox.layers[0].attention.dense

Da Xiao

Has anyone tried this library with stablediffusion?

I'm just curious if anyone's tried this library for quantization with stable diffusion models?

NameError: name 'str2optimizer8bit_blockwise' is not defined

i got this error when running dreambooth training from here

chiron    | /usr/local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//github.com/pypa/get-pip/raw/5eaac1050023df1f5c98b173b248c260023f2278/public/get-pip.py')}
chiron    |   warn(
chiron    | CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
chiron    | /usr/local/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
chiron    |   warn(
chiron    | WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
chiron    | CUDA SETUP: Loading binary /usr/local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
chiron    | /usr/local/lib/python3.9/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
chiron    |   warn(
Steps:   0%|          | 0/800 [00:01<?, ?it/s, loss=0.183, lr=5e-6]Traceback (most recent call last):
chiron    |   File "/app/train_dreambooth.py", line 592, in <module>
chiron    |     main()
chiron    |   File "/app/train_dreambooth.py", line 560, in main
chiron    |     optimizer.step()
chiron    |   File "/usr/local/lib/python3.9/site-packages/accelerate/optimizer.py", line 140, in step
chiron    |     self.optimizer.step(closure)
chiron    |   File "/usr/local/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
chiron    |     return wrapped(*args, **kwargs)
chiron    |   File "/usr/local/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
chiron    |     return func(*args, **kwargs)
chiron    |   File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
chiron    |     return func(*args, **kwargs)
chiron    |   File "/usr/local/lib/python3.9/site-packages/bitsandbytes/optim/optimizer.py", line 265, in step
chiron    |     self.update_step(group, p, gindex, pindex)
chiron    |   File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
chiron    |     return func(*args, **kwargs)
chiron    |   File "/usr/local/lib/python3.9/site-packages/bitsandbytes/optim/optimizer.py", line 506, in update_step
chiron    |     F.optimizer_update_8bit_blockwise(
chiron    |   File "/usr/local/lib/python3.9/site-packages/bitsandbytes/functional.py", line 858, in optimizer_update_8bit_blockwise
chiron    |     str2optimizer8bit_blockwise[optimizer_name][0](
chiron    | NameError: name 'str2optimizer8bit_blockwise' is not defined
Steps:   0%|          | 0/800 [00:01<?, ?it/s, loss=0.183, lr=5e-6]
chiron    | Traceback (most recent call last):
chiron    |   File "/usr/local/bin/accelerate", line 8, in <module>
chiron    |     sys.exit(main())
chiron    |   File "/usr/local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
chiron    |     args.func(args)
chiron    |   File "/usr/local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
chiron    |     simple_launcher(args)
chiron    |   File "/usr/local/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
chiron    |     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

result from python -m bitsandbytes

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = False
COMPUTE_CAPABILITIES_PER_GPU = ['7.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable

name 'str2optimizer32bit' is not defined

run inside docker on VM linux ubuntu 20, 64GB, Intel(R) Xeon(R) CPU @ 2.20GHz, Nvidia GV100GL [Tesla V100 SXM2 16GB]

❌ Inference Error: invalid configuration argument in ops.cu

I am able to load an 8-bit optimized BLOOM model into memory, provided by https://huggingface.co/joaoalvarenga/bloom-8bit. Unfortunately, I'm receiving the error message:

Error invalid configuration argument at line 69 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

I'm afraid my hardware does not support the bitsandbytes cuda prerequisites?
My system:

8x NVIDIA RTX A6000
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 113

Any insights would be greatly appreciated.

More information on training from scratch and finetuning

Thanks for the great work!

I am looking for some additional information on using the library to train a model from scratch or fine-tuning. The only information I could find were in the appendices of the corresponding paper. Specifically,

How many GPU's (and which) were used to train the Roberta model to train it from scratch at Appendix D?
How many GPU's (and which) were used to fine-tune the Roberta-large model in the fine-tuning section at Appendix E?
Is there any calculation done on how many GPUs you would need to train models with different # parameters with the library?
If you have you any additional/updated insights for training from scratch or fine-tuning, that would be wonderful!

Thanks for your help!

Could you give a code example on how to reproduce the results of c4 perplexity as in the LLM.int8 paper?

I'm new to hugging face, and I would be grateful if you can give me some hints on evaluate the c4 dataset or other dataset in you paper.

Meaning for the "Int8 absmax row-wise + decomposition" combination in paper ?

Hi, I have problem understanding the "Int8 absmax row-wise + decomposition" entry in Table 1. Does it mean "Absmax LLM.int8() (row-wise + decomp)" ? Because it does not contain the "LLM.int8()" keyword, I'm wondering if it refers to some combination else.
Thanks!

Cannot use int8

I tried to use 8xA100 to run BLOOM. But I cannot do load_in_8bit. I tried to follow the instruction here load the model by model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', load_in_8bit=True, max_memory=max_memory) Basically, if I don't have max_memory=max_memory, then most memory would go the gpu:0 and then CUDA out of memory error. If I put max_memory=max_memory, it will throw 8-bit operation are not supported under CPU.

OSError: libcublas.so.11: cannot open shared object file: No such file or directory

My environment is:
linux system：centos7
cuda: 10.2
torch：1.11.0
bitsandbytes：0.34.0

Then the error is：

import bitsandbytes as bnb

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /xxx/anaconda3 did not contain libcudart.so as expected! Searching further paths...
  warn(
/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('printf "\\033]0;%s@%s'), PosixPath('%s\\007" "${USER}" "${HOSTNAME%%.*}" "${PWD/#$HOME/~}";/etc/sysconfig/bash-prompt-history')}
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: CUDA version lower than 11 are currenlty not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: Detected CUDA version 102
CUDA SETUP: Loading binary /xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda102_nocublaslt.so...
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-b6686ac3a796> in <module>
----> 1 import bitsandbytes as bnb

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/__init__.py in <module>
      4 # LICENSE file in the root directory of this source tree.
      5
----> 6 from .autograd._functions import (
      7     MatmulLtState,
      8     bmm_cublas,

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in <module>
      3
      4 import torch
----> 5 import bitsandbytes.functional as F
      6
      7 from dataclasses import dataclass

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/functional.py in <module>
     11 from torch import Tensor
     12
---> 13 from .cextension import COMPILED_WITH_CUDA, lib
     14 from functools import reduce  # Required in Python 3
     15

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/cextension.py in <module>
     39
     40
---> 41 lib = CUDALibrary_Singleton.get_instance().lib
     42 try:
     43     lib.cadam32bit_g32

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/cextension.py in get_instance(cls)
     35         if cls._instance is None:
     36             cls._instance = cls.__new__(cls)
---> 37             cls._instance.initialize()
     38         return cls._instance
     39

~/xxx/anaconda3/lib/python3.8/site-packages/bitsandbytes/cextension.py in initialize(self)
     29         else:
     30             print(f"CUDA SETUP: Loading binary {binary_path}...")
---> 31             self.lib = ct.cdll.LoadLibrary(binary_path)
     32
     33     @classmethod

~/xxx/anaconda3/lib/python3.8/ctypes/__init__.py in LoadLibrary(self, name)
    457
    458     def LoadLibrary(self, name):
--> 459         return self._dlltype(name)
    460
    461 cdll = LibraryLoader(CDLL)

~/xxx/anaconda3/lib/python3.8/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    379
    380         if handle is None:
--> 381             self._handle = _dlopen(self._name, mode)
    382         else:
    383             self._handle = handle

OSError: libcublas.so.11: cannot open shared object file: No such file or directory

I can not update the cuda version into 11.x. How should I solve the issue?

quantizer documentation

Thanks for the amazing work.
I am reading the documentation and it says : 8-bit quantization: Quantile, Linear, and Dynamic quantization
In the code i also observe block wise quantization.
Can you please share some documentation or pointer to these quantizers . Also curios if these quantizer preserve not just multiplication of matrices but also distance metrics like euclidean , ip, cosine . This would be helpful for ANN (approx nearest neighbour ) techniques as well.

pip install bitsandbytes-cuda111 is not working

Confusing project homepage

Hello!

The bitsandbytes Project Homepage as displayed on https://pypi.org/project/bitsandbytes/ does not make much sense: it points to http://packages.python.org/bitsandbytes, which is not accessible.

I would expect this URL to point to either https://github.com/TimDettmers/bitsandbytes or https://bitsandbytes.readthedocs.io/en/latest/ (although the latter feels like it's left over from the outdated facebookresearch repo).

It would involve changing this line:

bitsandbytes/setup.py

Line 27 in 9b5f2ed

url="http://packages.python.org/bitsandbytes",

Tom Aarsen

lib64-nvidia did not contain libcudart.so

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

Different batch sizes lead to different inference results

Hi,

I found that when setting load_in_8bit=True, different batch sizes will lead to very different results, even if I'm doing inference-only. I found this phenomenon for several HF pretrained language models with int8.
A simple example is as follow, where I got very different results when comparing out1 and out2.

Thank you!

GPU: 1 RTX3090, Driver version: 470.103.01
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 114

from transformers import GPT2Tokenizer, AutoModelForCausalLM
import torch

tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-1.3b")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b",
	device_map='auto', load_in_8bit=True)

#model.cuda()
model.eval()


@torch.no_grad()
def do_inference(model, input_ids, attention_mask):
  outputs = model(input_ids=input_ids.cuda(), attention_mask=attention_mask.cuda())
  return outputs.logits.cpu()


batch_sents = [
 'Review: luminous interviews and amazingly evocative film from three decades ago \nSentiment:',
 'Review: with fewer gags to break the tedium \nSentiment:',
 'Review: aims for poetry and ends up sounding like satire \nSentiment:',
 'Review: no way original \nSentiment:'
 ]
enc_inputs = tokenizer(batch_sents, return_tensors='pt', padding=True)


# run inference with batch_size = 2
out1 = []
for i in range(0, len(batch_sents), 2):
  out = do_inference(model, enc_inputs['input_ids'][i:i+2], enc_inputs['attention_mask'][i:i+2])
  out1.append(out)
out1 = torch.cat(out1)

# run inference with batch_size = 4
out2 = do_inference(model, enc_inputs['input_ids'], enc_inputs['attention_mask'])

print(torch.abs(out1-out2).max()) #got tensor(2.0664, dtype=torch.float16) on my machine

Using nerdy rodent's dreamlab training, I have error on training about cuda.

I am using Nerdy Rodent's dreamlab local install video which I have followed step by step, at the end bitsandbytes seems to give an error. I tried reloading all the CUDA stuff and tried the new 11.8 cuda version which seems to differ from video and still gives same error:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /home/user/anaconda3/envs/diffusers did not contain libcudart.so as expected! Searching further paths...
warn(
/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')}
warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
Traceback (most recent call last):
File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 657, in
main()
File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 446, in main
import bitsandbytes as bnb
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/init.py", line 6, in
from .autograd._functions import (
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 5, in
import bitsandbytes.functional as F
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/functional.py", line 13, in
from .cextension import COMPILED_WITH_CUDA, lib
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 41, in
lib = CUDALibrary_Singleton.get_instance().lib
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 37, in get_instance
cls._instance.initialize()
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 15, in initialize
binary_name = evaluate_cuda_setup()
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 132, in evaluate_cuda_setup
cc = get_compute_capability(cuda)
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 105, in get_compute_capability
ccs = get_compute_capabilities(cuda)
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 83, in get_compute_capabilities
check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus)))
AttributeError: 'NoneType' object has no attribute 'cuDeviceGetCount'
Traceback (most recent call last):
File "/home/user/anaconda3/envs/diffusers/bin/accelerate", line 8, in
sys.exit(main())
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=training', '--output_dir=classes', '--instance_prompt=A sks dog', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=no', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1.

	def get_potentially_lib_path_containing_env_vars() -> Dict[str, str]:
	return {
	env_var: value
	for env_var, value in os.environ.items()
	if is_relevant_candidate_env_var(env_var, value)
	}

	def remove_non_existent_dirs(candidate_paths: Set[Path]) -> Set[Path]:
	non_existent_directories: Set[Path] = {
	path for path in candidate_paths if not path.exists()
	}

	if non_existent_directories:
	warn(
	"WARNING: The following directories listed in your path were found to "
	f"be non-existent: {non_existent_directories}"
	)

	return candidate_paths - non_existent_directories

timdettmers / bitsandbytes Goto Github PK

bitsandbytes's Introduction

bitsandbytes

License

bitsandbytes's People

Contributors

Stargazers

Watchers

Forkers

bitsandbytes's Issues

Summary

Background

Bug details

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`bitsandbytes`