The server presents the UI but seems to be missing the APIs? The exa

I've just published a llamafile 0.2 release <a href="https://github.com/Mozilla-Ocho/l

I cherry-picked OpenAI compatibility yesterday in <a class="commit-link" data-hovercar

This one is working for me: <a href="https://huggingface.co/jartine/mistral-7b.llamafi

I've just published a llamafile 0.2 release <a href="https://github.com/M

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

First of all, <a class="user-mention notranslate" data-hovercard-type="user" data-hove

The request reported in the issue seems to work too: <div class="snippet-clipboard

Server Missing OpenAI API Support? about llamafile HOT 12 CLOSED

mozilla-ocho commented on July 30, 2024

Server Missing OpenAI API Support?

from llamafile.

Comments (12)

jart commented on July 30, 2024 7

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

Download llamafile-server-0.2 and chmod +x it
Download zipalign-0.2 and chmod +x it
Run unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.
Run ./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args to put the weights and argument file inside your latest and greatest llamafile executable.
Run ./llamafile-server-0.2 and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

from llamafile.

jart commented on July 30, 2024 3

I cherry-picked OpenAI compatibility yesterday in 401dd08. It hasn't been incorporated into a release yet. I'll update this issue when the next release goes out. The llamafiles on Hugging Face will be updated too.

from llamafile.

dave1010 commented on July 30, 2024 2

For completeness in case it helps, this curl command works fine for me too: llama.cpp/server/README.md

(base) ➜  ~ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702042664,"id":"chatcmpl-LBodkSXWGkmxLu7pH39Lv2zF8jE6cxny","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

The server logs:

slot 0 released (155 tokens in cache)
slot 0 is processing [task id: 11]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    5827.25 ms /    77 tokens (   75.68 ms per token,    13.21 tokens per second)
print_timings:        eval time =    1809.26 ms /    43 runs   (   42.08 ms per token,    23.77 tokens per second)
print_timings:       total time =    7636.51 ms
slot 0 released (121 tokens in cache)
{"timestamp":1702042664,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":57680,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

Some debugging info in case it's helpful:

(base) ➜  ~ system_profiler SPHardwareDataType|grep -v UUID
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,3
      Model Number: Z15J000PGB/A
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 16 GB
      System Firmware Version: 10151.1.1
      OS Loader Version: 10151.1.1
      Serial Number (system): PL2C3FY765
      Provisioning UDID: 00006000-000861892206801E
      Activation Lock Status: Enabled

from llamafile.

jart commented on July 30, 2024 1

OK I've uploaded all the new .llamafiles to Hugging Face, for anyone who'd rather just re-download.

Enjoy!

from llamafile.

dave1010 commented on July 30, 2024 1

This one is working for me: https://huggingface.co/jartine/mistral-7b.llamafile/blob/main/mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile

I'm using https://github.com/simonw/llm to connect to it, so not sure of the exact requests it's making.

(base) ➜  ~ llm --version
llm, version 0.1
(base) ➜  ~ cat '/Users/dave/Library/Application Support/io.datasette.llm/extra-openai-models.yaml'
- model_id: llamafile
  model_name: llamafile
  api_base: "http://localhost:8080/v1"
(base) ➜  ~ llm -m llamafile "what llm are you"
I am Mistral, a large language model trained by Mistral AI. How can I assist you today?

from llamafile.

mofosyne commented on July 30, 2024 1

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:
1. Download [llamafile-server-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/llamafile-server-0.2) and chmod +x it

2. Download [zipalign-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/zipalign-0.2) and chmod +x it

3. Run `unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.

4. Run `./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` to put the weights and argument file inside your latest and greatest llamafile executable.

5. Run `./llamafile-server-0.2` and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest `mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile` that I'm uploading to HuggingFace presently.
So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

#412 now merged in which will give you the option of using llamafile-upgrade-engine to upgrade the engine in a more convenient manner when you install llamafile to your system.

This is done simply by calling llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile in your folder with the llamafile in it.

Usage Example / Expected Console Output

$ llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile 
== Engine Version Check ==
Engine version from mistral-7b-instruct-v0.1-Q4_K_M-server: llamafile v0.4.1
Engine version from /usr/local/bin/llamafile: llamafile v0.8.4
== Repackaging / Upgrading ==
extracting...
Archive:  mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.amd64  
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.arm64  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/compcap.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/llamafile.h  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-alloc.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.m  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.metal  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-quants.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/completion.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.html  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/json-schema-to-grammar.mjs  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Anchorage  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Beijing  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Berlin  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Boulder  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Chicago  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GMT  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GST  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Honolulu  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Israel  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Japan  
 extracting: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/London  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Melbourne  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/New_York  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/UTC  
 extracting: /tmp/tmp.FtvmAfSWty/.cosmo  
 extracting: /tmp/tmp.FtvmAfSWty/.args  
 extracting: /tmp/tmp.FtvmAfSWty/mistral-7b-instruct-v0.1.Q4_K_M.gguf  
 extracting: /tmp/tmp.FtvmAfSWty/ggml-cuda.dll  
repackaging...
== Completed ==
Original File: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
Upgraded File: mistral-7b-instruct-v0.1-Q4_K_M-server.updated.llamafile

from llamafile.

dzlab commented on July 30, 2024

There a will be new Server binaries? or we can use with already downloaded ones like mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile. It will be great if we won't need to re-download the whole 4GB file.

from llamafile.

dzlab commented on July 30, 2024

@jart thanks, I followed the instructions you provided and got a v0.2 llamafile server binary. Now when I start the server (on Mac M1) then try the curl command from llama.cpp/server/README.md the server crashes consistently with this error

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Applications/HOME/Tools/llamafile/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1]    34103 abort      ./llamafile-server-0.2

from llamafile.

jasonacox commented on July 30, 2024

First of all, @jart , thank you!!! We are getting close:

curl -i http://localhost:8080/v1/models
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 132
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"data":[{"created":1701489258,"id":"mistral-7b-instruct-v0.1.Q4_K_M.gguf","object":"model","owned_by":"llamacpp"}],"object":"list"

But as @dzlab mentions, there is an assertion failure during /v1/chat/completions POST and causes the server to crash (core dump).

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed

llamafile/llama.cpp/server/json.h

Lines 21305 to 21318 in 73ee0b1

 /// @brief access specified object element 

 /// @sa https://json.nlohmann.me/api/basic_json/operator%5B%5D/ 

 const_reference operator[](const typename object_t::key_type& key) const 

 { 

 // const operator[] only works for objects 

 if (JSON_HEDLEY_LIKELY(is_object())) 

 { 

 auto it = m_value.object->find(key); 

 JSON_ASSERT(it != m_value.object->end()); 

 return it->second; 

 } 

 JSON_THROW(type_error::create(305, detail::concat("cannot use operator[] with a string argument with ", type_name()), this)); 

 }

from llamafile.

dave1010 commented on July 30, 2024

The request reported in the issue seems to work too:

(base) ➜  ~ curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 470
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702027652,"id":"chatcmpl-PajeeqdFmAP5VNrzZztEJwKi9bF4czMj","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

from llamafile.

jart commented on July 30, 2024

@dave1010 glad to hear it's working for you!

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

from llamafile.

jasonacox commented on July 30, 2024

@dave1010 Thank you! This helped me narrow in on the issue. I am able to get this model to run with all the API curl examples with no issue on my Mac (M2). The assertion error only shows up on my Linux Ubuntu 22.04 box (using CPU only and with a GTX 3090 GPU).

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

Will do! I'll open it up focused on Linux.

from llamafile.

Server Missing OpenAI API Support? about llamafile HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	/// @brief access specified object element
	/// @sa https://json.nlohmann.me/api/basic_json/operator%5B%5D/
	const_reference operator[](const typename object_t::key_type& key) const
	{
	// const operator[] only works for objects
	if (JSON_HEDLEY_LIKELY(is_object()))
	{
	auto it = m_value.object->find(key);
	JSON_ASSERT(it != m_value.object->end());
	return it->second;
	}

	JSON_THROW(type_error::create(305, detail::concat("cannot use operator[] with a string argument with ", type_name()), this));
	}