Comments (12)
I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.
It will be great if we won't need to re-download the whole 4GB file.
You don't have to redownload. Here's what you can try:
- Download llamafile-server-0.2 and chmod +x it
- Download zipalign-0.2 and chmod +x it
- Run
unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args
on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files. - Run
./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args
to put the weights and argument file inside your latest and greatest llamafile executable. - Run
./llamafile-server-0.2
and enjoy! You've just recreated on your own what should be a bit-identical copy of the latestmistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
that I'm uploading to HuggingFace presently.
So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.
from llamafile.
I cherry-picked OpenAI compatibility yesterday in 401dd08. It hasn't been incorporated into a release yet. I'll update this issue when the next release goes out. The llamafiles on Hugging Face will be updated too.
from llamafile.
For completeness in case it helps, this curl command works fine for me too: llama.cpp/server/README.md
(base) ➜ ~ curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702042664,"id":"chatcmpl-LBodkSXWGkmxLu7pH39Lv2zF8jE6cxny","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%
The server logs:
slot 0 released (155 tokens in cache)
slot 0 is processing [task id: 11]
slot 0 : kv cache rm - [0, end)
print_timings: prompt eval time = 5827.25 ms / 77 tokens ( 75.68 ms per token, 13.21 tokens per second)
print_timings: eval time = 1809.26 ms / 43 runs ( 42.08 ms per token, 23.77 tokens per second)
print_timings: total time = 7636.51 ms
slot 0 released (121 tokens in cache)
{"timestamp":1702042664,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":57680,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
Some debugging info in case it's helpful:
(base) ➜ ~ system_profiler SPHardwareDataType|grep -v UUID
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro18,3
Model Number: Z15J000PGB/A
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 16 GB
System Firmware Version: 10151.1.1
OS Loader Version: 10151.1.1
Serial Number (system): PL2C3FY765
Provisioning UDID: 00006000-000861892206801E
Activation Lock Status: Enabled
from llamafile.
OK I've uploaded all the new .llamafiles to Hugging Face, for anyone who'd rather just re-download.
- https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main
- https://huggingface.co/jartine/mistral-7b.llamafile/tree/main
- https://huggingface.co/jartine/wizardcoder-13b-python/tree/main
Enjoy!
from llamafile.
This one is working for me: https://huggingface.co/jartine/mistral-7b.llamafile/blob/main/mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
I'm using https://github.com/simonw/llm to connect to it, so not sure of the exact requests it's making.
(base) ➜ ~ llm --version
llm, version 0.1
(base) ➜ ~ cat '/Users/dave/Library/Application Support/io.datasette.llm/extra-openai-models.yaml'
- model_id: llamafile
model_name: llamafile
api_base: "http://localhost:8080/v1"
(base) ➜ ~ llm -m llamafile "what llm are you"
I am Mistral, a large language model trained by Mistral AI. How can I assist you today?
from llamafile.
I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.
It will be great if we won't need to re-download the whole 4GB file.
You don't have to redownload. Here's what you can try:
1. Download [llamafile-server-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/llamafile-server-0.2) and chmod +x it 2. Download [zipalign-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/zipalign-0.2) and chmod +x it 3. Run `unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files. 4. Run `./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` to put the weights and argument file inside your latest and greatest llamafile executable. 5. Run `./llamafile-server-0.2` and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest `mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile` that I'm uploading to HuggingFace presently.
So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.
#412 now merged in which will give you the option of using llamafile-upgrade-engine
to upgrade the engine in a more convenient manner when you install llamafile to your system.
This is done simply by calling llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
in your folder with the llamafile in it.
Usage Example / Expected Console Output
$ llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
== Engine Version Check ==
Engine version from mistral-7b-instruct-v0.1-Q4_K_M-server: llamafile v0.4.1
Engine version from /usr/local/bin/llamafile: llamafile v0.8.4
== Repackaging / Upgrading ==
extracting...
Archive: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
inflating: /tmp/tmp.FtvmAfSWty/.symtab.amd64
inflating: /tmp/tmp.FtvmAfSWty/.symtab.arm64
inflating: /tmp/tmp.FtvmAfSWty/llamafile/compcap.cu
inflating: /tmp/tmp.FtvmAfSWty/llamafile/llamafile.h
inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.cu
inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-alloc.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend-impl.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.cu
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-impl.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.m
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.metal
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-quants.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml.h
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/completion.js
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.html
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.js
inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/json-schema-to-grammar.mjs
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Anchorage
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Beijing
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Berlin
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Boulder
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Chicago
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GMT
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GST
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Honolulu
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Israel
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Japan
extracting: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/London
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Melbourne
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/New_York
inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/UTC
extracting: /tmp/tmp.FtvmAfSWty/.cosmo
extracting: /tmp/tmp.FtvmAfSWty/.args
extracting: /tmp/tmp.FtvmAfSWty/mistral-7b-instruct-v0.1.Q4_K_M.gguf
extracting: /tmp/tmp.FtvmAfSWty/ggml-cuda.dll
repackaging...
== Completed ==
Original File: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
Upgraded File: mistral-7b-instruct-v0.1-Q4_K_M-server.updated.llamafile
from llamafile.
There a will be new Server binaries? or we can use with already downloaded ones like mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile. It will be great if we won't need to re-download the whole 4GB file.
from llamafile.
@jart thanks, I followed the instructions you provided and got a v0.2 llamafile server binary. Now when I start the server (on Mac M1) then try the curl command from llama.cpp/server/README.md the server crashes consistently with this error
llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Applications/HOME/Tools/llamafile/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1] 34103 abort ./llamafile-server-0.2
from llamafile.
First of all, @jart , thank you!!! We are getting close:
curl -i http://localhost:8080/v1/models
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 132
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp
{"data":[{"created":1701489258,"id":"mistral-7b-instruct-v0.1.Q4_K_M.gguf","object":"model","owned_by":"llamacpp"}],"object":"list"
But as @dzlab mentions, there is an assertion failure during /v1/chat/completions
POST and causes the server to crash (core dump).
llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed
llamafile/llama.cpp/server/json.h
Lines 21305 to 21318 in 73ee0b1
from llamafile.
The request reported in the issue seems to work too:
(base) ➜ ~ curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}'
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 470
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702027652,"id":"chatcmpl-PajeeqdFmAP5VNrzZztEJwKi9bF4czMj","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%
from llamafile.
@dave1010 glad to hear it's working for you!
@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?
from llamafile.
@dave1010 Thank you! This helped me narrow in on the issue. I am able to get this model to run with all the API curl examples with no issue on my Mac (M2). The assertion error only shows up on my Linux Ubuntu 22.04 box (using CPU only and with a GTX 3090 GPU).
@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?
Will do! I'll open it up focused on Linux.
from llamafile.
Related Issues (20)
- Bug: stuck in extracting /zip/ggml-rocm.dll to /C/Users/x/.llamafile/v/0.8.7/ggml-rocm.dll HOT 2
- Support Flash Attention in server mode HOT 1
- Bug: Support Read-Only Filesystems HOT 7
- Bug: Add Support of Deepseek-MoE
- Feature Request: Can the Llamafile server be ready prior to model warming? HOT 18
- Feature Request: Automate update of upstream llama.cpp HOT 6
- Bug: --gpu option cannot work on win10, not friendly to WIN. HOT 2
- Bug: fatal error: the cpu feature AVX was required at build time but isn't available on this system exiting process. HOT 4
- antivirus stopping the llamafile to run HOT 3
- <3>WSL (13) ERROR: UtilAcceptVsock:250: accept4 failed 110 HOT 3
- Bug: unknown pre-tokenizer type: ''mistral-bpe" when running the new Mistral-Nemo model HOT 5
- Bug: WSL can't launch llamafile with subprocess module, whereas it works when launching it from terminal HOT 1
- Bug: Unable to allocate memory for image embeddings
- Bug: unsupported op 'MUL_MAT' on bf16 but not f16 on SmolLM HOT 1
- Bug: Not starting in windows HOT 11
- CPU memory alloc on Windows sometimes fails HOT 9
- Bug: NUMA support on Windows
- Bug: low CPU usage on AWS Graviton4 compared to ollama
- Bug: Mixtral 8x7B fails to return a response after a couple of API calls whill running on AWS g6.12xlarge EC2 instance
- Bug: HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llamafile.