deep-diver / llm-as-chatbot Goto Github PK

View Code? Open in Web Editor NEW

3.2K 3.2K 381.0 19.98 MB

LLM as a Chatbot Service

License: Apache License 2.0

Python 97.79% Jupyter Notebook 2.21%

llm-as-chatbot's Introduction

llm-as-chatbot's People

Stargazers

Watchers

Forkers

simboyz denisefan28 gururise maralski aliasfoxkde gogit2194 techthiyanes stevendbennett matbee-eth zurichrain bizism s1530129650 jakoblorz yankuo111 lapnd vinhjaxt tianqingseforu yunho0130 uakbr pinyoothotaboot hertera1 cartertsai effort99 shendsaliaga goooice buluzhai ailabteam eltociear cedrickchee nitcanken codeaudit nickborisenko wilfoderek rakawae aliskf skevin-dev vitalhakiz cyd3nt hsyngmtrk itechnology-rs dardanisufi95 aymenfurter grationmi alesbetik jaedukseo pseudobit ippatev captain320 krafime n1ckfg lapsule artzip tchigher ypandagit hhy5277 hoooon89 wizrds pami0000 colinane kiminh valentindahlin goswamig sydulamin10 4thgen plaban1981 ilokalin qqq-tech jianantian abhimore19 idoganzer summer-joker857 jackusa kaojinz jqiang125 caleb-chaineffect kephir4eg godmapper b1sounours dbonattoj slidersun masoudhashemi ai-jie01 gitsrc hamductai1031 web3mirror c00renut vladsokolovv9 ryanwangzf alexlavoie42 ssemiya edgartapara yibit hyojunguy rexiome mragungsetiaji k0ntra203 rooben-me rioncarter mxcoldhit xargs007

llm-as-chatbot's Issues

Is there something I am doing wrong?

here is my command
python app.py --base_url decapoda-research/llama-13b-hf --ft_ckpt_url chansung/alpaca-lora-13b --port 6006

I have no idea why chatbot give me wrong response.
there is some mistake?

future usage error

Running on local URL: http://0.0.0.0:6006

To create a public link, set share=True in launch().
너는 행복하니?
RetryError[<Future at 0x7f3544954520 state=finished raised TypeError>]
Traceback (most recent call last):
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/gradio/blocks.py", line 1032, in process_api
result = await self.call_function(
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/gradio/blocks.py", line 858, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home1/irname/Alpaca-LoRA-Serve/koalpaca/lib/python3.9/site-packages/gradio/utils.py", line 448, in async_iteration
return next(iterator)
File "/home1/irname/Alpaca-LoRA-Serve/app.py", line 33, in chat_stream
for tokens in bot_response:
File "/home1/irname/Alpaca-LoRA-Serve/gen.py", line 105, in call
del final_tokens, input_ids
UnboundLocalError: local variable 'input_ids' referenced before assignment

I have a problem of excuting this app.
The gradio is correctly up but when I submit sentence, then give me above error.

I made up venv enviroment and "pip install -r requirements.txt" and gcc 7.1 compiled for CXXABI_1.3.9 problem.
My machine enviroment is Intel Xeon and nvidia P40 and software driver version of cuda is 11.7

Any plan to support gptq?

Hi there,
Do you have any plan to support 4bit quant like gptq? https://github.com/qwopqwop200/GPTQ-for-LLaMa

Something went wrong Expecting value: line 1 column 1 (char 0)

I had installed all of this requirement, but when I chat with it on website ,ti toast error.

Something went wrong
Expecting value: line 1 column 1 (char 0)

Is this error casued by my server? or how do I resolve this error, I have no idea. help me pls, thanks

Can't run starchat: fails with `AttributeError: module 'global_vars' has no attribute 'gen_config'`

Trying to run starchat and getting an error. The model downloaded but when I click "Confirm", I just get an error.

Update: Also getting this error when trying to download models. Might be an env issue, going to nuke my conda env and try again.

top_k and repetition_penalty

Can top_k and repetition_penalty be added? Thanks.

Enable MPS inference for Apple silicon

Could MPS support be added to enable faster inference using Apple silicon? See tloen/alpaca-lora#48 for an example implementation using the original 7B Alpaca-LoRA checkpoint.

add or not history_response

add or not history_response in "sub_convs = sub_convs + f"""### Instruction:{history_prompt}"，for enhance the expression of context?

Inference Multiple GPU

what does it take to adapt the code to do Inference on multiple GPU for the 30B model.

i have 4 X 3090 GPU and want to try it out.

i know it's possible for training but didn't see any adaptation for inference

I am running the 30b on colab premium plus. Inference via the gradio app is just going on for minutes without any response.Is that expected?

Can this project run on apple silicon M1pro chip with 16G ram?

why do not support load from local dir but always download

why do not support loading from local dir but always download? May I pull request for this?

Dark Mode

When OS is in dark mode, Gradio seems to be trying to adjust but failing in inverting a few colors. Here is an example

Workaround: You can pass the theme in the url like this, but wanted to see if could be fixed in code

http://localhost:6006/?__theme=light

prompt format (input vs instruction).

In the readme you say input->instruction->response, but during training it's instruction->input->response. Now the latter seems like a mistake, but it is what it is src.

So perhaps you need to change your prompting to match the training?

Error on w10 sentencepiece_wrap.cxx(2822): fatal error C1083: sentencepiece_processor.h: No such file or directory,

`Building wheels for collected packages: transformers, peft
Building wheel for transformers (pyproject.toml) ... done
Created wheel for transformers: filename=transformers-4.28.0.dev0-py3-none-any.whl size=6758050 sha256=87734f128d74dd329fd4b6084e27d77c656ac41b82cf15445a9b4c7b0f7e9970
Stored in directory: C:\temp\pip-ephem-wheel-cache-k0nzphiz\wheels\32\4b\78\f195c684dd3a9ed21f3b39fe8f85b48df7918581b6437be143
Building wheel for peft (pyproject.toml) ... done
Created wheel for peft: filename=peft-0.3.0.dev0-py3-none-any.whl size=40919 sha256=adaf0efec8276af4c54e2dd1c41f0266409814c65b4564c14cdd1b3ebeb8a156
Stored in directory: C:\temp\pip-ephem-wheel-cache-k0nzphiz\wheels\42\ec\c4\eb24dac74be83ba2ed4817037a784d1c775e317cb8de69963f
Successfully built transformers peft
Installing collected packages: tokenizers, sentencepiece, rfc3986, pytz, pydub, mpmath, ffmpy, bitsandbytes, xxhash, websockets, urllib3, uc-micro-py, typing-extensions, toolz, sympy, sniffio, six, regex, pyyaml, python-multipart, pyrsistent, pyparsing, pycryptodome, psutil, pillow, packaging, orjson, numpy, networkx, multidict, mdurl, markupsafe, loralib, kiwisolver, idna, h11, fsspec, frozenlist, fonttools, filelock, entrypoints, dill, cycler, colorama, charset-normalizer, certifi, attrs, async-timeout, aiofiles, yarl, tqdm, requests, python-dateutil, pydantic, pyarrow, multiprocess, markdown-it-py, linkify-it-py, jsonschema, jinja2, contourpy, click, anyio, aiosignal, uvicorn, torch, starlette, responses, pandas, mdit-py-plugins, matplotlib, huggingface-hub, httpcore, aiohttp, transformers, httpx, fastapi, altair, accelerate, peft, gradio, datasets
DEPRECATION: sentencepiece is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559
Running setup.py install for sentencepiece ... error
error: subprocess-exited-with-error

× Running setup.py install for sentencepiece did not run successfully.
│ exit code: 1
╰─> [23 lines of output]
running install
D:\alpaca\alpaca-lora-serve\venv\Lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-311
creating build\lib.win-amd64-cpython-311\sentencepiece
copying src\sentencepiece/init.py -> build\lib.win-amd64-cpython-311\sentencepiece
copying src\sentencepiece/_version.py -> build\lib.win-amd64-cpython-311\sentencepiece
copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-311\sentencepiece
copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-cpython-311\sentencepiece
running build_ext
building 'sentencepiece._sentencepiece' extension
creating build\temp.win-amd64-cpython-311
creating build\temp.win-amd64-cpython-311\Release
creating build\temp.win-amd64-cpython-311\Release\src
creating build\temp.win-amd64-cpython-311\Release\src\sentencepiece
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -ID:\alpaca\alpaca-lora-serve\venv\include -IC:\Python311\include -IC:\Python311\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-cpython-311\Release\src/sentencepiece/sentencepiece_wrap.obj /std:c++17 /MT /I..\build\root\include
cl: Є®¬ ¤ п бва®Є warning D9025: ЇҐаҐ®ЇаҐ¤Ґ«ҐЁҐ "/MD" "/MT"
sentencepiece_wrap.cxx
src/sentencepiece/sentencepiece_wrap.cxx(2822): fatal error C1083: ЌҐ г¤ Ґвбп ®вЄалвм д ©« ўЄ«озҐЁҐ: sentencepiece_processor.h: No such file or directory,
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> sentencepiece

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip available: 22.3 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip`

How to run this serve?

Colab Notebook not working (and fix)

Notebook shows that model cannot be found.

I fixed it by changing the command to:

!python app.py --base-url $base_model --ft-ckpt-url $finetuned_model --share

instead of

!python app.py --base_url $base_model --ft_ckpt_url $finetuned_model --share

(that is, replace underscore with hyphen)

gradio running local: instant error, public: neverending wait time

issue:

if accessed locally, immediately after clicking "Send Prompt" these error toasts show:

if accessed publicly, no error shows, but neverending wait animation:

browser:

when investigating the browser console, no error shows for public, but 404 on queue/join for local.

shell:

on the shell no errors are visible in both cases:

python app.py --base_url "decapoda-research/llama-13b-hf" --ft_ckpt_url "chansung/alpaca-lora-13b" --port 6006 --share

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /.../envs/alpaca-serve/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /.../envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [00:26<00:00,  1.52it/s]
Running on local URL:  http://0.0.0.0:6006
Running on public URL: https://9816048a290abdc59f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces

sharing any good conversation sequences

I hope you guys found this application useful in some way. I want to include more examples, so please share some interesting conversation history if possible :) I really appreciate your participants in advance!

Exception: cublasLt ran into an error!

I am running on Ubuntu 22.04 with a 8GB GPU

The error as follows when giving the prompt:

cuBLAS API failed with status 15
A: torch.Size([24, 4096]), B: torch.Size([4096, 4096]), C: (24, 4096); (lda, ldb, ldc): (c_int(768), c_int(131072), c_int(768)); (m, n, k): (c_int(24), c_int(4096), c_int(4096))
Traceback (most recent call last):
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 1020, in process_api
result = await self.call_function(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 844, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/touhi/Desktop/llm/Alpaca-LoRA-Serve/app.py", line 58, in chat_batch
bot_responses = get_output_batch(
File "/home/touhi/Desktop/llm/Alpaca-LoRA-Serve/gen.py", line 22, in get_output_batch
generated_id = model.generate(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 1405, in generate
return self.greedy_search(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 2200, in greedy_search
outputs = self(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/touhi/anaconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
error detected

cpu or amd gpu support

hello

there is flash-attn in requirements.txt

but flash-attn can't be installed on system without cuda:

`pip install flash-attn` output

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-ti1s7v3f/flash-attn_bcf49b8c56c940b99a4dea552b99f570/setup.py", line 106, in <module>
          raise_if_cuda_home_none("flash_attn")
        File "/tmp/pip-install-ti1s7v3f/flash-attn_bcf49b8c56c940b99a4dea552b99f570/setup.py", line 53, in raise_if_cuda_h
ome_none
          raise RuntimeError(
      RuntimeError: flash_attn was requested, but nvcc was not found.  Are you sure your environment has nvcc available?  
If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'de
vel' will provide nvcc.
      
      Warning: Torch did not find available GPUs on this system.
       If your intention is to cross-compile, this is not an error.
      By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
      Volta (compute capability 7.0), Turing (compute capability 7.5),
      and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
      If you wish to cross-compile for a single specific architecture,
      export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
      
      torch.__version__  = 2.0.1+cu117

please add ability to use LLM-As-Chatbot without flash-attn and without cuda

bug in chatbot UI

When i say "hello" , after telling the first sentence by text input box gets blocked with the loading animation.
i'm not able to enter anything!
i attached a screenshot below, any idea why this issue is there in chatbot UI?

also can you add "reset" button to reset chat?

match model_type: error on latest commit

ON April 3, with the latest commit it seemed to be working fine. After this commit the following error shows up:

946dbf7

Any suggestion?

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary ***/miniconda3/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Traceback (most recent call last):
  File "***/Alpaca-LoRA-Serve/app.py", line 12, in <module>
    from utils import get_chat_interface
  File "***/Alpaca-LoRA-Serve/utils.py", line 5
    match model_type:
          ^

How can I test the API with Postman? as Rest API

ggml-alpaca-3b-q4 on CPU?

Is it possible to run ggml-alpaca-3b-4q.bin model on cpu ram? And specify filepath instead of url to hf?

Is the "chansung/alpaca-lora-7b/" model private?

Trying to access the model but it is asking for user and password. Getting this type of error.

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100% 33/33 [01:13<00:00,  2.24s/it]
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.9/dist-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/chansung/alpaca-lora-7b/resolve/main/adapter_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/peft/utils/config.py", line 99, in from_pretrained
    config_file = hf_hub_download(pretrained_model_name_or_path, CONFIG_NAME)
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py", line 1160, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py", line 1501, in get_hf_file_metadata
    hf_raise_for_status(r)
  File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_errors.py", line 291, in hf_raise_for_status
    raise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-641d8b09-25d2805039de008223098b08)

Repository Not Found for url: https://huggingface.co/chansung/alpaca-lora-7b/resolve/main/adapter_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

https://huggingface.co/chansung/alpaca-lora-7b/resolve/main/adapter_config.json
this link is not accessible.

Distillation

These models don't scale very well to consumer hardware. I think you should put some time into trying to compress the model through distillation. You could take the original model, drop some of the layers and call this the student model. Then use the teacher (original) model to train the student using random inputs to the teacher/student models. The student model's objective is to try and match the teacher models outputs.

Issue with code - TensorRT not found

2023-04-07 12:21:36.077895: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

BUG REPORT
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//172.28.0.1'), PosixPath('8013'), PosixPath('http')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2kg58rxtccclt --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
usage: app.py
[-h]
[--base_url BASE_URL]
[--ft_ckpt_url FT_CKPT_URL]
[--port PORT]
[--batch_size BATCH_SIZE]
[--api_open]
[--share]
[--gen_config_path GEN_CONFIG_PATH]
[--gen_config_summarization_path GEN_CONFIG_SUMMARIZATION_PATH]
[--get_constraints_config_path GET_CONSTRAINTS_CONFIG_PATH]
[--multi_gpu]
[--force_download_ckpt]
app.py: error: unrecognized arguments: --sharefinetuned_model

Streaming response

Hi,

thank you very much for this work. Do you plan to support streaming response any time soon like text-generation-webui does ?

Best
Alexander

Other LORA models

Would it be possible to support other LORA adapters?

For example, I've finetuned llama on alpaca + dolly (https://huggingface.co/couchpotato888/dolpaca_gpt4_13b_1e_adapter/tree/main) but I can't seem to use it on your Colab (it tells me it's unsupported) - it would be really nice if I could use your interface with my finetune.

Thanks for the great work on it btw, the interface looks really nice!

Getting AttributeError: 'NoneType' object has no attribute 'device'

`This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
-------state_chatbots------
([],)
----inside
Below is a history of instructions that describe tasks, paired with an input that provides further context. Write a response that appropriately completes the request by remembering the conversation history.

Instruction:What is AI

Response:

there is only a prompt
Traceback (most recent call last):
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/gradio/blocks.py", line 1020, in process_api
result = await self.call_function(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/gradio/blocks.py", line 844, in call_function
prediction = await anyio.to_thread.run_sync(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/jupyter/Llama/Alpaca-LoRA-Serve/app.py", line 88, in chat
bot_responses = get_output(
File "/home/jupyter/Llama/Alpaca-LoRA-Serve/gen.py", line 27, in get_output
generated_id = model.generate(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/generation/utils.py", line 1490, in generate
return self.beam_search(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/generation/utils.py", line 2749, in beam_search
outputs = self(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 770, in forward
outputs = self.model(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 619, in forward
layer_outputs = decoder_layer(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
File "/opt/conda/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1698, in transform
prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'`

Is there any way to test the API with Postman and the Rest API?

For example, as in its API documentation that's how we can use the API. but still am unable to use the API correctly.

Traceback (most recent call last): File "F:\gpt\LLM-As-Chatbot-main\menu_app.py", line 7, in <module> import global_vars File "F:\gpt\LLM-As-Chatbot-main\global_vars.py", line 2, in <module> from transformers import GenerationConfig ImportError: cannot import name 'GenerationConfig' from 'transformers' (D:\Users\Administrator\anaconda3\lib\site-packages\transformers\init.py)

Traceback (most recent call last):
File "F:\gpt\LLM-As-Chatbot-main\menu_app.py", line 7, in
import global_vars
File "F:\gpt\LLM-As-Chatbot-main\global_vars.py", line 2, in
from transformers import GenerationConfig
ImportError: cannot import name 'GenerationConfig' from 'transformers' (D:\Users\Administrator\anaconda3\lib\site-packages\transformers_init_.py)
这是什么问题

Repetitive responses from 7B model.

Hi again!

I've been playing with the 7B model as detailed in the README.md and I noticed that, for some reason, the output is not as good as expected.

> List all Canadian provinces in alphabetical order.
Alberta, British Columbia, Manitoba, New Brunswick, Newfoundland, Nova Scotia, Ontario, Prince Edward Island, Quebec
> Which ones are on the east side?
British Columbia, Ontario, Quebec, Prince Edward Island, Nova Scotia, New Brunswick
> What foods are famous in each province?
British Columbia: Salmon, Fish & Chips, Poutine, Maple Syrup, Nanaimoimoimoimoimoimo

Alberta: Poutine, Tacoacoaco

Manitoba: Tacoaco

New Brunswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswswsw

Could there be a problem with the downloaded models? I did fix the tokenizer_config.json to get it to read "tokenizer_class": "LlamaTokenizer".

Thanks!

401 error

Thanks for this repos. When I run this line of code below, it throws 401 error.. It seems some model is private.

!python3.9 app.py --base_url decapoda-research/llama-7b-hf --ft_ckpt_url=chansung/alpaca-lora-7b --share yes

Error :
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/chansung/alpaca-lora-7b/resolve/main/adapter_config.json

Can the code use beam search in Streaming Mode?

Use my model on local server

I want to use my finetuned model on my local server. Is it possible?

How to run it with llama-7b-hf-int4?

I don't have enough GPU memory. Could you give the guide to run with INT4?

Default models not working

Did HuggingFace make more private?

How to support multiple users

Thanks a lot, but how to support multiple users.

AttributeError: module 'numpy' has no attribute 'object'

Current version cant start.
AttributeError: module 'numpy' has no attribute 'object'.
np.object was a deprecated alias for the builtin object. To avoid this error in existing code, use object by itself. Doing this will not modify any behavior and is safe.

Reproduce:
Just clone this git and follow intructions

AttributeError: 'NoneType' object has no attribute 'device'

Traceback (most recent call last):
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/tenacity/init.py", line 382, in call
result = fn(*args, **kwargs)
File "/home/tbe/repos/Alpaca-LoRA-Serve/gen.py", line 119, in _infer
return model_fn(**kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/home/tbe/repos/bitsandbytes/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/tbe/repos/bitsandbytes/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/tbe/repos/stanford_alpaca/env/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/tbe/repos/bitsandbytes/bitsandbytes/autograd/_functions.py", line 317, in forward
state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
File "/home/tbe/repos/bitsandbytes/bitsandbytes/functional.py", line 1700, in transform
prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

I am getting this by following the instruction in the readme

query regarding MPT, RedPajama and Falcon models

Hey @deep-diver ,

is it possible to load
mpt-7b-chat
redpajama-7b-chat
falcon-7b-instruct

in 8 Bit ?

Have you tried loading these models in 8 Bit.
If so , how did you do it?

Are they supported for 8 bit inference using bitsandbytes?
if so , could you share an example implementation/configuration of loading these models in 8 bit.

Run offline

Can we run this offline with downloades models?

Error with bitsandbytes and running at all.

I've followed the instructions for installation on windows using Miniconda3.

Everything installs correctly but when I try to run it the following error occurs:
(alpaca-serve) C:\Alpaca-LoRA-Serve>python app.py --base_url C:\text-generation-webui-new\text-generation-webui\models\llama-7b-hf --ft_ckpt_url chainyo/alpaca-lora-7b

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
argument of type 'WindowsPath' is not iterable
C:\Users\jeff_\miniconda3\envs\alpaca-serve\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Traceback (most recent call last):
File "C:\Alpaca-LoRA-Serve\app.py", line 234, in
run(args)
File "C:\Alpaca-LoRA-Serve\app.py", line 112, in run
model, tokenizer = load_model(
File "C:\Alpaca-LoRA-Serve\model.py", line 11, in load_model
model = LlamaForCausalLM.from_pretrained(
File "C:\Users\jeff_\miniconda3\envs\alpaca-serve\lib\site-packages\transformers\modeling_utils.py", line 2619, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

I'm unsure how to proceed. Any advice is appreciated.

Default launch gets stuck at "Loading.."

When launched with python app.py, all the api requests seem to be getting an extra / in the beginning.

The code offending seems to be line 709 in app.py

root_path=f"/{root_path}" should be replaced with root_path=f"{root_path}"

[Backlog] Add sparse models to options

I don't know of any right now, this is just a placeholder for people to fill in if they are aware of such options.

Here is an example of a performance increase from this pruning process: https://github.com/mlcommons/inference_results_v3.0/tree/main/open/NeuralMagic

Problem trying to run the 13B model

Hi!
I managed to run this fine with the 7B model and loRA as stated in the README. However, attempts at running the 13B model and corresponding loRA finetuning is not successful so far for me.

What I did:

$ echo $BASE_URL
decapoda-research/llama-13b-hf
$ echo $FINETUNED_CKPT_URL
chansung/alpaca-lora-13b
$ $ python app.py --base_url $BASE_URL --ft_ckpt_url $FINETUNED_CKPT_URL --port 6006

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/tong/yes/envs/alpaca-serve/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/tong/yes/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /home/tong/yes/envs/alpaca-serve/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Traceback (most recent call last):
  File "/home/tong/Alpaca-LoRA-Serve/app.py", line 175, in <module>
    run(args)
  File "/home/tong/Alpaca-LoRA-Serve/app.py", line 78, in run
    model, tokenizer = load_model(
  File "/home/tong/Alpaca-LoRA-Serve/model.py", line 12, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/tong/yes/envs/alpaca-serve/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2587, in from_pretrained
    raise ValueError(
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

Any help here would be greatly appreciated! Thank you for the amazing work here so far!

Manual fixes I have done.

In case it is salient, I did have to do the manual fixing ( cp libbitsandbytes_cuda112.so libbitsandbytes_cpu.so) in my Conda environment as reported by many people in the past to get even the 7B example to work.

Getting torch.cuda.OutOfMemoryError: CUDA out of memory

When running 30B version, getting error when executing line 18 of model.py.

model = PeftModel.from_pretrained(model, finetuned, device_map={'': 0})

If device_map={'': 0} is removed, then no error on loading the model.

I am using a server that has 7 GPUs and each of them 32GB.

Is the code supports multiple GPUs? If not, is it possible to make it multiple GPU supported?

i have some question about Input, Instruction, Response

I gave name and age for AI characterization. But when I asked for the name, It generated a different answer. �i think it should the input and instruction be reversed. what should you think about it?