1b5d / llm-api Goto Github PK
View Code? Open in Web Editor NEWRun any Large Language Model behind a unified API
License: MIT License
Run any Large Language Model behind a unified API
License: MIT License
this repo really taught me why a running example is more important than the actual project.
Tried everything on the readme but couldn't get this to work,
config.yml
:
# models_dir: /models
# model_family: gptq_llama
# setup_params:
# repo_id: repo_id
# filename: model.safetensors
# model_params:
# group_size: 128
# wbits: 4
# cuda_visible_devices: "0"
# device: "cuda:0"
# st_device: 0
# file: config.yaml
# models_dir: /models
# model_family: vicuna
# setup_params:
# repo_id: TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g
# filename: vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
# convert: false
# migrate: false
# model_params:
# group_size: 128
# wbits: 4
# cuda_visible_devices: "0"
# device: "cuda:0"
# st_device: 0
# ctx_size: 2000
#----------------------- alpaca
models_dir: /models
model_family: alpaca
model_name: 7b
setup_params:
repo_id: Sosaka/Alpaca-native-4bit-ggml
filename: ggml-alpaca-7b-q4.bin
convert: false
migrate: false
model_params:
ctx_size: 2000
seed: -1
n_threads: 8
n_batch: 2048
n_parts: 01
last_n_tokens_size: 16
#-----------------------
# models_dir: /models # dir inside the container
# model_family: alpaca
# model_name: 7b
# setup_params:
# key: value
# model_params:
# key: value
# models_dir: /models # dir inside the container
# model_family: alpaca
# model_name: 7b
# setup_params:
# repo_id: user/repo_id
# filename: ggml-model-q4_0.bin
# convert: false
# migrate: false
# model_params:
# ctx_size: 2000
# seed: -1
# n_threads: 8
# n_batch: 2048
# n_parts: -1
# last_n_tokens_size: 16
Models directory:
First, thanks for sharing your repo. I'm not understanding why I can't use this model directly. I may be missing something since I'm just learning. I was hoping to avoid having to download and mess around with llama-cpp to get this working. My goal is to spin up a web server to be able to generate and use embeddings with these models and use langchain at some point too.
My config.yaml
models_dir: /models
model_family: alpaca
model_name: 7b
setup_params:
repo_id: Sosaka/Alpaca-native-4bit-ggml
filename: ggml-alpaca-7b-q4.bin
convert: true
migrate: false
model_params:
ctx_size: 2000
seed: -1
n_threads: 8
n_batch: 2048
n_parts: -1
last_n_tokens_size: 16
What am I doing wrong? I tried setting convert: true
thinking it would convert the old model, which I noticed convert.py but it doesn't seem to be used. I'm confused about which is better, convert.py or llama-cpp's convert-unversioned-ggml-to-ggml.py since the log says to use ggerganov's convert script.
Guidance?
I am getting this issue running it in docker with default stuff, torch isn't in the requirements or the default Dockerfile?
Traceback (most recent call last):
File "/llm-api/./app/main.py", line 14, in
from app.llms import get_model_class
File "/llm-api/app/llms/init.py", line 7, in
from .gptq_llama.gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/init.py", line 4, in
from .gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 10, in
import torch # pylint: disable=import-error
ModuleNotFoundError: No module named 'torch'
I have messed around trying to inject all the requirements but I am wondering if I am missing something, I'd have thought the default packaged docker-compose file should just work out of the box CPU wise right?
I can bypass all of that by adding:
RUN pip install torch safetensors transformers
To the Dockerfile, but then get this issue:
Traceback (most recent call last):
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 21, in
from .GPTQforLLaMa import quant
ImportError: cannot import name 'quant' from 'app.llms.gptq_llama.GPTQforLLaMa' (unknown location)The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/llm-api/./app/main.py", line 14, in
from app.llms import get_model_class
File "/llm-api/app/llms/init.py", line 7, in
from .gptq_llama.gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/init.py", line 4, in
from .gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 24, in
raise ImportError(
ImportError: the GPTQ-for-LLaMa lib is missing, please install it first
I was hoping with it using Docker it would have all the dependencies installed automatically, am I doing it wrong?
First of all thanks for the repo, looks ideal.
I'm using gpt-x-alpaca-13b-native-4bit-128g-cuda.pt
which can be found at repo anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
on HF.
The error I'm receiving is
invalid model file (bad magic [got 0x4034b50 want 0x67676a74])
Is this something which should be compatible?
Expected response:
The capital of France is Paris.
Actual response:
</s> What is the capital of France?\n The capital of France is Paris.</s>
Code:
from langchain_llm_api import LLMAPI, APIEmbeddings
llm = LLMAPI(
params={"temp": 0.2},
verbose=True
)
print(llm("What is the capital of France?"))
Config:
models_dir: /models
model_family: gptq_llama
setup_params:
repo_id: TheBloke/WizardLM-7B-uncensored-GPTQ
filename: WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
model_params:
group_size: 128
wbits: 4
cuda_visible_devices: "0"
device: "cuda:0"
st_device: 0
Firstly thank you so much for building this. I am really looking forward to using it with LangChain to get chat functions into Slack. Hopefully they integrate it soon! When my Python is a bit more up-to-scratch, I'll hopefully be able to get involved!
Secondly, I'm experiencing an issue when using the GPU containers for my models using safetensors. The output is long but here is a snippet:
llm-api-app | File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
llm-api-app | raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
llm-api-app | RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
llm-api-app | Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.g_idx", "model.layers.0.self_attn.o_proj.g_idx", "model.layers.0.self_attn.q_proj.g_idx", "model.layers.0.self_attn.v_proj.g_idx", "model.layers.0.mlp.down_proj.g_idx", "model.layers.0.mlp.gate_proj.g_idx", "model.layers.0.mlp.up_proj.g_idx", "model.layers.1.self_attn.k_proj.g_idx", "model.layers.1.self_attn.o_proj.g_idx", "model.layers.1.self_attn.q_proj.g_idx", "model.layers.1.self_attn.v_proj.g_idx", "model.layers.1.mlp.down_proj.g_idx", "model.layers.1.mlp.gate_proj.g_idx", "model.layers.1.mlp.up_proj.g_idx", "model.layers.2.self_attn.k_proj.g_idx", "model.layers.2.self_attn.o_proj.g_idx", "model.layers.2.self_attn.q_proj.g_idx", "model.layers.2.self_attn.v_proj.g_i ... snip ...
I have tried a few models models, which all exhibit this issue. If you want to test with one, I reliably get the error with the following model:
TheBloke/wizard-mega-13B-GPTQ
After running a safetensor model I also then can not run other models e.g. anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
I am using your upstream image for this (not building locally) via the provide compose file.
Is there anything I should be doing different when using GPTQ models? Please let me know if you need more information.
Maybe it is siły question but as a developer experimentong with tools i would like to have oprion to upload my data (ebooks etc) in plain text and metadata and train model with it to building private knowledge with AI agent. Is that possible to count on it?
The code seem to only support CPU at the moment.
Would docker be able to access GPUs ?
I presume there is a minimum CPU requirement like needing AVX2, AVX-512, FP16C or something?
Could you document the minimum instruction set and extensions required.
root@1d1c4289f303:/llm-api# python app/main.py
2023-10-26 23:31:19,237 - INFO - llama - found an existing model /models/llama_601507219781/ggml-model-q4_0.bin
2023-10-26 23:31:19,237 - INFO - llama - setup done successfully for /models/llama_601507219781/ggml-model-q4_0.bin
Illegal instruction (core dumped)
root@1d1c4289f303:/llm-api#
--- modulename: llama, funcname: init
llama.py(289): self.verbose = verbose
llama.py(291): self.numa = numa
llama.py(292): if not Llama.__backend_initialized:
llama.py(293): if self.verbose:
llama.py(294): llama_cpp.llama_backend_init(self.numa)
--- modulename: llama_cpp, funcname: llama_backend_init
llama_cpp.py(475): return _lib.llama_backend_init(numa)
Illegal instruction (core dumped)
I assume this has CPU requirements.
ENV CMAKE_ARGS "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.
https://github.com/OpenMathLib/OpenBLAS/blob/develop/README.md
At the moment the llms and associated inference / embeddings are in class specific implementations. Not sure if this is necessary or DRY as you want to start catering for different architectures (eg. Dolly v2).
Consider refactoring with interfaces using a config: AutoConfig = AutoConfig.from_pretrained(path_or_repo)
type approach. As this might allow for scaling to different model types without the need for heavy configuration on the users side or massive amounts of boilerplate rewriting of specific implementations.
Eg. stub
from transformers import AutoConfig
config: AutoConfig = AutoConfig.from_pretrained(path_or_repo)
if config.model_type == "llama":
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer: LlamaTokenizer = LlamaTokenizer.from_pretrained(path_or_repo)
model: LlamaForCausalLM = LlamaForCausalLM.from_pretrained(
path_or_repo, **model_kwargs
) # , load_in_8bit=True, device_map="auto")
elif config.model_type == "gpt_neox":
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
tokenizer: GPTNeoXForCausalLM = GPTNeoXForCausalLM.from_pretrained(path_or_repo)
model: GPTNeoXTokenizerFast = GPTNeoXTokenizerFast.from_pretrained(
path_or_repo, **model_kwargs
) # , load_in_8bit=True, device_map="auto")
else:
logger.error(f"Unable to determine model type {config.model_type}. Attempting AutoModel")
try:
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(path_or_repo)
model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
path_or_repo, **model_kwargs
) # , load_in_8bit=True, device_map="auto")
except Exception as e:
logger.exception(e)
raise
Love this! Being able to spin up a local llm easily and interact with it over tried-and-true HTTP is a dream.
I frequently use vanilla llama with custom loras, and need the ability to load a model with a lora, and ideally, switch the lora and even load multiple loras in a given order.
My python is not the strongest - any chance of getting a feature like this added? I'm fairly certain we can take inspiration from text-generation-webui, which allows for loading a lora and changing lora while the model is loaded. (Afaik, TGWUI does not support loading more than one lora at a time yet.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.