GithubHelp home page GithubHelp logo

Comments (12)

jart avatar jart commented on July 30, 2024 7

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

  1. Download llamafile-server-0.2 and chmod +x it
  2. Download zipalign-0.2 and chmod +x it
  3. Run unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.
  4. Run ./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args to put the weights and argument file inside your latest and greatest llamafile executable.
  5. Run ./llamafile-server-0.2 and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

from llamafile.

jart avatar jart commented on July 30, 2024 3

I cherry-picked OpenAI compatibility yesterday in 401dd08. It hasn't been incorporated into a release yet. I'll update this issue when the next release goes out. The llamafiles on Hugging Face will be updated too.

from llamafile.

dave1010 avatar dave1010 commented on July 30, 2024 2

For completeness in case it helps, this curl command works fine for me too: llama.cpp/server/README.md

(base) ➜  ~ curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702042664,"id":"chatcmpl-LBodkSXWGkmxLu7pH39Lv2zF8jE6cxny","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

The server logs:

slot 0 released (155 tokens in cache)
slot 0 is processing [task id: 11]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    5827.25 ms /    77 tokens (   75.68 ms per token,    13.21 tokens per second)
print_timings:        eval time =    1809.26 ms /    43 runs   (   42.08 ms per token,    23.77 tokens per second)
print_timings:       total time =    7636.51 ms
slot 0 released (121 tokens in cache)
{"timestamp":1702042664,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":57680,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

Some debugging info in case it's helpful:

(base) ➜  ~ system_profiler SPHardwareDataType|grep -v UUID
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,3
      Model Number: Z15J000PGB/A
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 16 GB
      System Firmware Version: 10151.1.1
      OS Loader Version: 10151.1.1
      Serial Number (system): PL2C3FY765
      Provisioning UDID: 00006000-000861892206801E
      Activation Lock Status: Enabled

from llamafile.

jart avatar jart commented on July 30, 2024 1

OK I've uploaded all the new .llamafiles to Hugging Face, for anyone who'd rather just re-download.

Enjoy!

from llamafile.

dave1010 avatar dave1010 commented on July 30, 2024 1

This one is working for me: https://huggingface.co/jartine/mistral-7b.llamafile/blob/main/mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile

I'm using https://github.com/simonw/llm to connect to it, so not sure of the exact requests it's making.

(base) ➜  ~ llm --version
llm, version 0.1
(base) ➜  ~ cat '/Users/dave/Library/Application Support/io.datasette.llm/extra-openai-models.yaml'
- model_id: llamafile
  model_name: llamafile
  api_base: "http://localhost:8080/v1"
(base) ➜  ~ llm -m llamafile "what llm are you"
I am Mistral, a large language model trained by Mistral AI. How can I assist you today?

from llamafile.

mofosyne avatar mofosyne commented on July 30, 2024 1

I've just published a llamafile 0.2 release https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.2 The downloads on Hugging Face will be updated in a couple hours.

It will be great if we won't need to re-download the whole 4GB file.

You don't have to redownload. Here's what you can try:

1. Download [llamafile-server-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/llamafile-server-0.2) and chmod +x it

2. Download [zipalign-0.2](https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2/zipalign-0.2) and chmod +x it

3. Run `unzip mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` on the old 0.1 llamafile you downloaded earlier, to extract the GGUF weights and arguments files.

4. Run `./zipalign-0.2 -0j llamafile-server-0.2 mistral-7b-instruct-v0.1.Q4_K_M.gguf .args` to put the weights and argument file inside your latest and greatest llamafile executable.

5. Run `./llamafile-server-0.2` and enjoy! You've just recreated on your own what should be a bit-identical copy of the latest `mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile` that I'm uploading to HuggingFace presently.

So it takes a bit more effort than redownloading. But it's a great option if you don't have gigabit Internet.

#412 now merged in which will give you the option of using llamafile-upgrade-engine to upgrade the engine in a more convenient manner when you install llamafile to your system.

This is done simply by calling llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile in your folder with the llamafile in it.

Usage Example / Expected Console Output
$ llamafile-upgrade-engine mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile 
== Engine Version Check ==
Engine version from mistral-7b-instruct-v0.1-Q4_K_M-server: llamafile v0.4.1
Engine version from /usr/local/bin/llamafile: llamafile v0.8.4
== Repackaging / Upgrading ==
extracting...
Archive:  mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.amd64  
  inflating: /tmp/tmp.FtvmAfSWty/.symtab.arm64  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/compcap.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/llamafile.h  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llamafile/tinyblas.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-alloc.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-backend.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.cu  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-cuda.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-impl.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.m  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-metal.metal  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml-quants.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/ggml.h  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/completion.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.html  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/index.js  
  inflating: /tmp/tmp.FtvmAfSWty/llama.cpp/server/public/json-schema-to-grammar.mjs  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Anchorage  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Beijing  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Berlin  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Boulder  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Chicago  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GMT  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/GST  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Honolulu  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Israel  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Japan  
 extracting: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/London  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/Melbourne  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/New_York  
  inflating: /tmp/tmp.FtvmAfSWty/usr/share/zoneinfo/UTC  
 extracting: /tmp/tmp.FtvmAfSWty/.cosmo  
 extracting: /tmp/tmp.FtvmAfSWty/.args  
 extracting: /tmp/tmp.FtvmAfSWty/mistral-7b-instruct-v0.1.Q4_K_M.gguf  
 extracting: /tmp/tmp.FtvmAfSWty/ggml-cuda.dll  
repackaging...
== Completed ==
Original File: mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
Upgraded File: mistral-7b-instruct-v0.1-Q4_K_M-server.updated.llamafile

from llamafile.

dzlab avatar dzlab commented on July 30, 2024

There a will be new Server binaries? or we can use with already downloaded ones like mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile. It will be great if we won't need to re-download the whole 4GB file.

from llamafile.

dzlab avatar dzlab commented on July 30, 2024

@jart thanks, I followed the instructions you provided and got a v0.2 llamafile server binary. Now when I start the server (on Mac M1) then try the curl command from llama.cpp/server/README.md the server crashes consistently with this error

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Applications/HOME/Tools/llamafile/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1]    34103 abort      ./llamafile-server-0.2

from llamafile.

jasonacox avatar jasonacox commented on July 30, 2024

First of all, @jart , thank you!!! We are getting close:

curl -i http://localhost:8080/v1/models
HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 132
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"data":[{"created":1701489258,"id":"mistral-7b-instruct-v0.1.Q4_K_M.gguf","object":"model","owned_by":"llamacpp"}],"object":"list"

But as @dzlab mentions, there is an assertion failure during /v1/chat/completions POST and causes the server to crash (core dump).

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed 

llamafile/llama.cpp/server/json.h

Lines 21305 to 21318 in 73ee0b1

/// @brief access specified object element
/// @sa https://json.nlohmann.me/api/basic_json/operator%5B%5D/
const_reference operator[](const typename object_t::key_type& key) const
{
// const operator[] only works for objects
if (JSON_HEDLEY_LIKELY(is_object()))
{
auto it = m_value.object->find(key);
JSON_ASSERT(it != m_value.object->end());
return it->second;
}
JSON_THROW(type_error::create(305, detail::concat("cannot use operator[] with a string argument with ", type_name()), this));
}

from llamafile.

dave1010 avatar dave1010 commented on July 30, 2024

The request reported in the issue seems to work too:

(base) ➜  ~ curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

HTTP/1.1 200 OK
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 470
Content-Type: application/json
Keep-Alive: timeout=5, max=5
Server: llama.cpp

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"There once was a programmer named Mike\nWho wrote code that would often choke\nOn exceptions he'd throw\nAnd debugging would slow\nBut with Python, he learned to take the high road.","role":"assistant"}}],"created":1702027652,"id":"chatcmpl-PajeeqdFmAP5VNrzZztEJwKi9bF4czMj","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":43,"prompt_tokens":77,"total_tokens":120}}%

from llamafile.

jart avatar jart commented on July 30, 2024

@dave1010 glad to hear it's working for you!

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

from llamafile.

jasonacox avatar jasonacox commented on July 30, 2024

@dave1010 Thank you! This helped me narrow in on the issue. I am able to get this model to run with all the API curl examples with no issue on my Mac (M2). The assertion error only shows up on my Linux Ubuntu 22.04 box (using CPU only and with a GTX 3090 GPU).

@jasonacox could you post a new issue sharing the curl command you used that caused the assertion to fail?

Will do! I'll open it up focused on Linux.

from llamafile.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.