mozilla-ocho / llamafile Goto Github PK

Distribute and run LLMs with a single file.

License: Other

Makefile 0.08% Shell 0.06% C++ 93.50% C 4.25% Python 0.08% JavaScript 0.03% HTML 0.11% Objective-C 0.49% Metal 0.98% Cuda 0.01% Roff 0.17% Batchfile 0.01% AGS Script 0.22%

llamafile's Introduction

llamafile

llamafile lets you distribute and run LLMs with a single file. (announcement blog post)

Our goal is to make open LLMs much more accessible to both developers and end users. We're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that collapses all the complexity of LLMs down to a single-file executable (called a "llamafile") that runs locally on most computers, with no installation.

llamafile is a Mozilla Builders project.

Quickstart

The easiest way to try it for yourself is to download our example llamafile for the LLaVA model (license: LLaMA 2, OpenAI). LLaVA is a new LLM that can do more than just chat; you can also upload images and ask it questions about them. With llamafile, this all happens locally; no data ever leaves your computer.

Download llava-v1.5-7b-q4.llamafile (4.29 GB).
Open your computer's terminal.
If you're using macOS, Linux, or BSD, you'll need to grant permission for your computer to execute this new file. (You only need to do this once.)

chmod +x llava-v1.5-7b-q4.llamafile

If you're on Windows, rename the file by adding ".exe" on the end.
Run the llamafile. e.g.:

./llava-v1.5-7b-q4.llamafile

Your browser should open automatically and display a chat interface. (If it doesn't, just open your browser and point it at http://localhost:8080)
When you're done chatting, return to your terminal and hit Control-C to shut down llamafile.

Having trouble? See the "Gotchas" section below.

JSON API Quickstart

When llamafile is started, in addition to hosting a web UI chat server at http://127.0.0.1:8080/, an OpenAI API compatible chat completions endpoint is provided too. It's designed to support the most common OpenAI API use cases, in a way that runs entirely locally. We've also extended it to include llama.cpp specific features (e.g. mirostat) that may also be used. For further details on what fields and endpoints are available, refer to both the OpenAI documentation and the llamafile server README.

Curl API Client Example

The simplest way to get started using the API is to copy and paste the following curl command into your terminal.

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "model": "LLaMA_CPP",
  "messages": [
      {
          "role": "system",
          "content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
      },
      {
          "role": "user",
          "content": "Write a limerick about python exceptions"
      }
    ]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'

The response that's printed should look like the following:

{
   "choices" : [
      {
         "finish_reason" : "stop",
         "index" : 0,
         "message" : {
            "content" : "There once was a programmer named Mike\nWho wrote code that would often choke\nHe used try and except\nTo handle each step\nAnd his program ran without any hike.",
            "role" : "assistant"
         }
      }
   ],
   "created" : 1704199256,
   "id" : "chatcmpl-Dt16ugf3vF8btUZj9psG7To5tc4murBU",
   "model" : "LLaMA_CPP",
   "object" : "chat.completion",
   "usage" : {
      "completion_tokens" : 38,
      "prompt_tokens" : 78,
      "total_tokens" : 116
   }
}

Python API Client example

If you've already developed your software using the openai Python package (that's published by OpenAI) then you should be able to port your app to talk to llamafile instead, by making a few changes to base_url and api_key. This example assumes you've run pip3 install openai to install OpenAI's client software, which is required by this example. Their package is just a simple Python wrapper around the OpenAI API interface, which can be implemented by any server.

#!/usr/bin/env python3
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
    model="LLaMA_CPP",
    messages=[
        {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ]
)
print(completion.choices[0].message)

The above code will return a Python object like this:

ChatCompletionMessage(content='There once was a programmer named Mike\nWho wrote code that would often strike\nAn error would occur\nAnd he\'d shout "Oh no!"\nBut Python\'s exceptions made it all right.', role='assistant', function_call=None, tool_calls=None)

Other example llamafiles

We also provide example llamafiles for other models, so you can easily try out llamafile with different kinds of LLMs.

Model	Size	License	llamafile	other quants
LLaVA 1.5	3.97 GB	LLaMA 2	llava-v1.5-7b-q4.llamafile	See HF repo
TinyLlama-1.1B	2.05 GB	Apache 2.0	TinyLlama-1.1B-Chat-v1.0.F16.llamafile	See HF repo
Mistral-7B-Instruct	3.85 GB	Apache 2.0	mistral-7b-instruct-v0.2.Q4_0.llamafile	See HF repo
Phi-3-mini-4k-instruct	7.67 GB	Apache 2.0	Phi-3-mini-4k-instruct.F16.llamafile	See HF repo
Mixtral-8x7B-Instruct	30.03 GB	Apache 2.0	mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile	See HF repo
WizardCoder-Python-34B	22.23 GB	LLaMA 2	wizardcoder-python-34b-v1.0.Q5_K_M.llamafile	See HF repo
WizardCoder-Python-13B	7.33 GB	LLaMA 2	wizardcoder-python-13b.llamafile	See HF repo
LLaMA-3-Instruct-70B	37.25 GB	llama3	Meta-Llama-3-70B-Instruct.Q4_0.llamafile	See HF repo
LLaMA-3-Instruct-8B	5.37 GB	llama3	Meta-Llama-3-8B-Instruct.Q5_K_M.llamafile	See HF repo
Rocket-3B	1.89 GB	cc-by-sa-4.0	rocket-3b.Q5_K_M.llamafile	See HF repo
OLMo-7B	5.68 GB	Apache 2.0	OLMo-7B-0424.Q6_K.llamafile	See HF repo
Text Embedding Models
E5-Mistral-7B-Instruct	5.16 GB	MIT	e5-mistral-7b-instruct-Q5_K_M.llamafile	See HF repo
mxbai-embed-large-v1	0.7 GB	Apache 2.0	mxbai-embed-large-v1-f16.llamafile	See HF Repo

Here is an example for the Mistral command-line llamafile:

./mistral-7b-instruct-v0.2.Q5_K_M.llamafile --temp 0.7 -p '[INST]Write a story about llamas[/INST]'

And here is an example for WizardCoder-Python command-line llamafile:

./wizardcoder-python-13b.llamafile --temp 0 -e -r '```\n' -p '```c\nvoid *memcpy_sse2(char *dst, const char *src, size_t size) {\n'

And here's an example for the LLaVA command-line llamafile:

./llava-v1.5-7b-q4.llamafile --temp 0.2 --image lemurs.jpg -e -p '### User: What do you see?\n### Assistant:'

As before, macOS, Linux, and BSD users will need to use the "chmod" command to grant execution permissions to the file before running these llamafiles for the first time.

Unfortunately, Windows users cannot make use of many of these example llamafiles because Windows has a maximum executable file size of 4GB, and all of these examples exceed that size. (The LLaVA llamafile works on Windows because it is 30MB shy of the size limit.) But don't lose heart: llamafile allows you to use external weights; this is described later in this document.

Having trouble? See the "Gotchas" section below.

How llamafile works

A llamafile is an executable LLM that you can run on your own computer. It contains the weights for a given open LLM, as well as everything needed to actually run that model on your computer. There's nothing to install or configure (with a few caveats, discussed in subsequent sections of this document).

This is all accomplished by combining llama.cpp with Cosmopolitan Libc, which provides some useful capabilities:

llamafiles can run on multiple CPU microarchitectures. We added runtime dispatching to llama.cpp that lets new Intel systems use modern CPU features without trading away support for older computers.
llamafiles can run on multiple CPU architectures. We do that by concatenating AMD64 and ARM64 builds with a shell script that launches the appropriate one. Our file format is compatible with WIN32 and most UNIX shells. It's also able to be easily converted (by either you or your users) to the platform-native format, whenever required.
llamafiles can run on six OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD). If you make your own llama files, you'll only need to build your code once, using a Linux-style toolchain. The GCC-based compiler we provide is itself an Actually Portable Executable, so you can build your software for all six OSes from the comfort of whichever one you prefer most for development.
The weights for an LLM can be embedded within the llamafile. We added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely.
Finally, with the tools included in this project you can create your own llamafiles, using any compatible model weights you want. You can then distribute these llamafiles to other people, who can easily make use of them regardless of what kind of computer they have.

Using llamafile with external weights

Even though our example llamafiles have the weights built-in, you don't have to use llamafile that way. Instead, you can download just the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows' 4GB executable file size limit.

For Windows users, here's an example for the Mistral LLM:

curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.11/llamafile-0.8.11
curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
./llamafile.exe -m mistral.gguf

Windows users may need to change ./llamafile.exe to .\llamafile.exe when running the above command.

Gotchas and troubleshooting

On any platform, if your llamafile process is immediately killed, check if you have CrowdStrike and then ask to be whitelisted.

Mac

On macOS with Apple Silicon you need to have Xcode Command Line Tools installed for llamafile to be able to bootstrap itself.

If you use zsh and have trouble running llamafile, try saying sh -c ./llamafile. This is due to a bug that was fixed in zsh 5.9+. The same is the case for Python subprocess, old versions of Fish, etc.

Mac error "... cannot be opened because the developer cannot be verified"

Immediately launch System Settings, then go to Privacy & Security. llamafile should be listed at the bottom, with a button to Allow.
If not, then change your command in the Terminal to be sudo spctl --master-disable; [llama launch command]; sudo spctl --master-enable. This is because --master-disable disables all checking, so you need to turn it back on after quitting llama.

Linux

On some Linux systems, you might get errors relating to run-detectors or WINE. This is due to binfmt_misc registrations. You can fix that by adding an additional registration for the APE file format llamafile uses:

sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
sudo chmod +x /usr/bin/ape
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

Windows

As mentioned above, on Windows you may need to rename your llamafile by adding .exe to the filename.

Also as mentioned above, Windows also has a maximum file size limit of 4GB for executables. The LLaVA server executable above is just 30MB shy of that limit, so it'll work on Windows, but with larger models like WizardCoder 13B, you need to store the weights in a separate file. An example is provided above; see "Using llamafile with external weights."

On WSL, there are many possible gotchas. One thing that helps solve them completely is this:

[Unit]
Description=cosmopolitan APE binfmt service
After=wsl-binfmt.service

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

[Install]
WantedBy=multi-user.target

Put that in /etc/systemd/system/cosmo-binfmt.service.

Then run sudo systemctl enable cosmo-binfmt.

Another thing that's helped WSL users who experience issues, is to disable the WIN32 interop feature:

sudo sh -c "echo -1 > /proc/sys/fs/binfmt_misc/WSLInterop"

In the instance of getting a Permission Denied on disabling interop through CLI, it can be permanently disabled by adding the following in /etc/wsl.conf

[interop]
enabled=false

Raspberry Pi

On Raspberry Pi, if you get "mmap error 12" then it means your kernel is configured with fewer than 48 bits of address space. You need to upgrade to RPI 5. You can still use RPI 4 if you either (1) rebuild your kernel, or (2) get your SDcard OS image directly from Ubuntu (don't use RPI OS).

Supported OSes

llamafile supports the following operating systems, which require a minimum stock install:

Linux 2.6.18+ (i.e. every distro since RHEL5 c. 2007)
Darwin (macOS) 23.1.0+ [1] (GPU is only supported on ARM64)
Windows 8+ (AMD64 only)
FreeBSD 13+
NetBSD 9.2+ (AMD64 only)
OpenBSD 7+ (AMD64 only)

On Windows, llamafile runs as a native portable executable. On UNIX systems, llamafile extracts a small loader program named ape to $TMPDIR/.llamafile or ~/.ape-1.9 which is used to map your model into memory.

[1] Darwin kernel versions 15.6+ should be supported, but we currently have no way of testing that.

Supported CPUs

llamafile supports the following CPUs:

AMD64 microprocessors must have AVX. Otherwise llamafile will print an error and refuse to run. This means that if you have an Intel CPU, it needs to be Intel Sandybridge or newer (circa 2011+), and if you have an AMD CPU, then it needs to be Bulldozer or newer (circa 2011+). Support for AVX512, AVX2, FMA, F16C, and VNNI are conditionally enabled at runtime if you have a newer CPU.
ARM64 microprocessors must have ARMv8a+. This means everything from Apple Silicon to 64-bit Raspberry Pis will work, provided your weights fit into memory.

GPU support

llamafile supports the following kinds of GPUs:

Apple Metal
NVIDIA
AMD

GPU on MacOS ARM64 is supported by compiling a small module using the Xcode Command Line Tools, which need to be installed. This is a one time cost that happens the first time you run your llamafile. The DSO built by llamafile is stored in $TMPDIR/.llamafile or $HOME/.llamafile. Offloading to GPU is enabled by default when a Metal GPU is present. This can be disabled by passing -ngl 0 or --gpu disable to force llamafile to perform CPU inference.

Owners of NVIDIA and AMD graphics cards need to pass the -ngl 999 flag to enable maximum offloading. If multiple GPUs are present then the work will be divided evenly among them by default, so you can load larger models. Multiple GPU support may be broken on AMD Radeon systems. If that happens to you, then use export HIP_VISIBLE_DEVICES=0 which forces llamafile to only use the first GPU.

Windows users are encouraged to use our release binaries, because they contain prebuilt DLLs for both NVIDIA and AMD graphics cards, which only depend on the graphics driver being installed. If llamafile detects that NVIDIA's CUDA SDK or AMD's ROCm HIP SDK are installed, then llamafile will try to build a faster DLL that uses cuBLAS or rocBLAS. In order for llamafile to successfully build a cuBLAS module, it needs to be run on the x64 MSVC command prompt. You can use CUDA via WSL by enabling Nvidia CUDA on WSL and running your llamafiles inside of WSL. Using WSL has the added benefit of letting you run llamafiles greater than 4GB on Windows.

On Linux, NVIDIA users will need to install the CUDA SDK (ideally using the shell script installer) and ROCm users need to install the HIP SDK. They're detected by looking to see if nvcc or hipcc are on the PATH.

If you have both an AMD GPU and an NVIDIA GPU in your machine, then you may need to qualify which one you want used, by passing either --gpu amd or --gpu nvidia.

In the event that GPU support couldn't be compiled and dynamically linked on the fly for any reason, llamafile will fall back to CPU inference.

Source installation

Developing on llamafile requires a modern version of the GNU make command (called gmake on some systems), sha256sum (otherwise cc will be used to build it), wget (or curl), and unzip available at https://cosmo.zip/pub/cosmos/bin/. Windows users need cosmos bash shell too.

make -j8
sudo make install PREFIX=/usr/local

Here's an example of how to generate code for a libc function using the llama.cpp command line interface, utilizing WizardCoder-Python-13B weights:

llamafile \
  -m wizardcoder-python-13b-v1.0.Q8_0.gguf \
  --temp 0 -r '}\n' -r '```\n' \
  -e -p '```c\nvoid *memcpy(void *dst, const void *src, size_t size) {\n'

Here's a similar example that instead utilizes Mistral-7B-Instruct weights for prose composition:

llamafile -ngl 9999 \
  -m mistral-7b-instruct-v0.1.Q4_K_M.gguf \
  -p '[INST]Write a story about llamas[/INST]'

Here's an example of how llamafile can be used as an interactive chatbot that lets you query knowledge contained in training data:

llamafile -m llama-65b-Q5_K.gguf -p '
The following is a conversation between a Researcher and their helpful AI assistant Digital Athena which is a large language model trained on the sum of human knowledge.
Researcher: Good morning.
Digital Athena: How can I help you today?
Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
--keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
--in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'

Here's an example of how you can use llamafile to summarize HTML URLs:

(
  echo '[INST]Summarize the following text:'
  links -codepage utf-8 \
        -force-html \
        -width 500 \
        -dump https://www.poetryfoundation.org/poems/48860/the-raven |
    sed 's/   */ /g'
  echo '[/INST]'
) | llamafile -ngl 9999 \
      -m mistral-7b-instruct-v0.2.Q5_K_M.gguf \
      -f /dev/stdin \
      -c 0 \
      --temp 0 \
      -n 500 \
      --no-display-prompt 2>/dev/null

Here's how you can use llamafile to describe a jpg/png/gif/bmp image:

llamafile -ngl 9999 --temp 0 \
  --image ~/Pictures/lemurs.jpg \
  -m llava-v1.5-7b-Q4_K.gguf \
  --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
  -e -p '### User: What do you see?\n### Assistant: ' \
  --no-display-prompt 2>/dev/null

It's possible to use BNF grammar to enforce the output is predictable and safe to use in your shell script. The simplest grammar would be --grammar 'root ::= "yes" | "no"' to force the LLM to only print to standard output either "yes\n" or "no\n". Another example is if you wanted to write a script to rename all your image files, you could say:

llamafile -ngl 9999 --temp 0 \
    --image lemurs.jpg \
    -m llava-v1.5-7b-Q4_K.gguf \
    --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
    --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
    -e -p '### User: What do you see?\n### Assistant: ' \
    --no-display-prompt 2>/dev/null |
  sed -e's/ /_/g' -e's/$/.jpg/'
a_baby_monkey_on_the_back_of_a_mother.jpg

Here's an example of how to run llama.cpp's built-in HTTP server. This example uses LLaVA v1.5-7B, a multimodal LLM that works with llama.cpp's recently-added support for image inputs.

llamafile -ngl 9999 \
  -m llava-v1.5-7b-Q8_0.gguf \
  --mmproj llava-v1.5-7b-mmproj-Q8_0.gguf \
  --host 0.0.0.0

The above command will launch a browser tab on your personal computer to display a web interface. It lets you chat with your LLM and upload images to it.

Creating llamafiles

If you want to be able to just say:

./llava.llamafile

...and have it run the web server without having to specify arguments, then you can embed both the weights and a special .args inside, which specifies the default arguments. First, let's create a file named .args which has this content:

-m
llava-v1.5-7b-Q8_0.gguf
--mmproj
llava-v1.5-7b-mmproj-Q8_0.gguf
--host
0.0.0.0
-ngl
9999
...

As we can see above, there's one argument per line. The ... argument optionally specifies where any additional CLI arguments passed by the user are to be inserted. Next, we'll add both the weights and the argument file to the executable:

cp /usr/local/bin/llamafile llava.llamafile

zipalign -j0 \
  llava.llamafile \
  llava-v1.5-7b-Q8_0.gguf \
  llava-v1.5-7b-mmproj-Q8_0.gguf \
  .args

./llava.llamafile

Congratulations. You've just made your own LLM executable that's easy to share with your friends.

Distribution

One good way to share a llamafile with your friends is by posting it on Hugging Face. If you do that, then it's recommended that you mention in your Hugging Face commit message what git revision or released version of llamafile you used when building your llamafile. That way everyone online will be able verify the provenance of its executable content. If you've made changes to the llama.cpp or cosmopolitan source code, then the Apache 2.0 license requires you to explain what changed. One way you can do that is by embedding a notice in your llamafile using zipalign that describes the changes, and mention it in your Hugging Face commit.

Documentation

There's a manual page for each of the llamafile programs installed when you run sudo make install. The command manuals are also typeset as PDF files that you can download from our GitHub releases page. Lastly, most commands will display that information when passing the --help flag.

Running llamafile with models downloaded by third-party applications

This section answers the question "I already have a model downloaded locally by application X, can I use it with llamafile?". The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow.

LM Studio

LM Studio stores downloaded models in ~/.cache/lm-studio/models, in subdirectories with the same name of the models (following HuggingFace's account_name/model_name format), with the same filename you saw when you chose to download the file.

So if you have downloaded e.g. the llama-2-7b.Q2_K.gguf file for TheBloke/Llama-2-7B-GGUF, you can run llamafile as follows:

cd ~/.cache/lm-studio/models/TheBloke/Llama-2-7B-GGUF
llamafile -m llama-2-7b.Q2_K.gguf

Ollama

When you download a new model with ollama, all its metadata will be stored in a manifest file under ~/.ollama/models/manifests/registry.ollama.ai/library/. The directory and manifest file name are the model name as returned by ollama list. For instance, for llama3:latest the manifest file will be named .ollama/models/manifests/registry.ollama.ai/library/llama3/latest.

The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose mediaType is application/vnd.ollama.image.model is the one referring to the model's GGUF file.

Each sha256 digest is also used as a filename in the ~/.ollama/models/blobs directory (if you look into that directory you'll see only those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the llama3:latest GGUF file digest is sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29, you can run llamafile as follows:

cd ~/.ollama/models/blobs
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29

Technical details

Here is a succinct overview of the tricks we used to create the fattest executable format ever. The long story short is llamafile is a shell script that launches itself and runs inference on embedded weights in milliseconds without needing to be copied or installed. What makes that possible is mmap(). Both the llama.cpp executable and the weights are concatenated onto the shell script. A tiny loader program is then extracted by the shell script, which maps the executable into memory. The llama.cpp executable then opens the shell script again as a file, and calls mmap() again to pull the weights into memory and make them directly accessible to both the CPU and GPU.

ZIP weights embedding

The trick to embedding weights inside llama.cpp executables is to ensure the local file is aligned on a page size boundary. That way, assuming the zip file is uncompressed, once it's mmap()'d into memory we can pass pointers directly to GPUs like Apple Metal, which require that data be page size aligned. Since no existing ZIP archiving tool has an alignment flag, we had to write about 500 lines of code to insert the ZIP files ourselves. However, once there, every existing ZIP program should be able to read them, provided they support ZIP64. This makes the weights much more easily accessible than they otherwise would have been, had we invented our own file format for concatenated files.

Microarchitectural portability

On Intel and AMD microprocessors, llama.cpp spends most of its time in the matmul quants, which are usually written thrice for SSSE3, AVX, and AVX2. llamafile pulls each of these functions out into a separate file that can be #includeed multiple times, with varying __attribute__((__target__("arch"))) function attributes. Then, a wrapper function is added which uses Cosmopolitan's X86_HAVE(FOO) feature to runtime dispatch to the appropriate implementation.

Architecture portability

llamafile solves architecture portability by building llama.cpp twice: once for AMD64 and again for ARM64. It then wraps them with a shell script which has an MZ prefix. On Windows, it'll run as a native binary. On Linux, it'll extract a small 8kb executable called APE Loader to ${TMPDIR:-${HOME:-.}}/.ape that'll map the binary portions of the shell script into memory. It's possible to avoid this process by running the assimilate program that comes included with the cosmocc compiler. What the assimilate program does is turn the shell script executable into the host platform's native executable format. This guarantees a fallback path exists for traditional release processes when it's needed.

GPU support

Cosmopolitan Libc uses static linking, since that's the only way to get the same executable to run on six OSes. This presents a challenge for llama.cpp, because it's not possible to statically link GPU support. The way we solve that is by checking if a compiler is installed on the host system. For Apple, that would be Xcode, and for other platforms, that would be nvcc. llama.cpp has a single file implementation of each GPU module, named ggml-metal.m (Objective C) and ggml-cuda.cu (Nvidia C). llamafile embeds those source files within the zip archive and asks the platform compiler to build them at runtime, targeting the native GPU microarchitecture. If it works, then it's linked with platform C library dlopen() implementation. See llamafile/cuda.c and llamafile/metal.c.

In order to use the platform-specific dlopen() function, we need to ask the platform-specific compiler to build a small executable that exposes these interfaces. On ELF platforms, Cosmopolitan Libc maps this helper executable into memory along with the platform's ELF interpreter. The platform C library then takes care of linking all the GPU libraries, and then runs the helper program which longjmp()'s back into Cosmopolitan. The executable program is now in a weird hybrid state where two separate C libraries exist which have different ABIs. For example, thread local storage works differently on each operating system, and programs will crash if the TLS register doesn't point to the appropriate memory. The way Cosmopolitan Libc solves that on AMD is by using SSE to recompile the executable at runtime to change %fs register accesses into %gs which takes a millisecond. On ARM, Cosmo uses the x28 register for TLS which can be made safe by passing the -ffixed-x28 flag when compiling GPU modules. Lastly, llamafile uses the __ms_abi__ attribute so that function pointers passed between the application and GPU modules conform to the Windows calling convention. Amazingly enough, every compiler we tested, including nvcc on Linux and even Objective-C on MacOS, all support compiling WIN32 style functions, thus ensuring your llamafile will be able to talk to Windows drivers, when it's run on Windows, without needing to be recompiled as a separate file for Windows. See cosmopolitan/dlopen.c for further details.

A note about models

The example llamafiles provided above should not be interpreted as endorsements or recommendations of specific models, licenses, or data sets on the part of Mozilla.

Security

llamafile adds pledge() and SECCOMP sandboxing to llama.cpp. This is enabled by default. It can be turned off by passing the --unsecure flag. Sandboxing is currently only supported on Linux and OpenBSD on systems without GPUs; on other platforms it'll simply log a warning.

Our approach to security has these benefits:

After it starts up, your HTTP server isn't able to access the filesystem at all. This is good, since it means if someone discovers a bug in the llama.cpp server, then it's much less likely they'll be able to access sensitive information on your machine or make changes to its configuration. On Linux, we're able to sandbox things even further; the only networking related system call the HTTP server will allowed to use after starting up, is accept(). That further limits an attacker's ability to exfiltrate information, in the event that your HTTP server is compromised.
The main CLI command won't be able to access the network at all. This is enforced by the operating system kernel. It also won't be able to write to the file system. This keeps your computer safe in the event that a bug is ever discovered in the GGUF file format that lets an attacker craft malicious weights files and post them online. The only exception to this rule is if you pass the --prompt-cache flag without also specifying --prompt-cache-ro. In that case, security currently needs to be weakened to allow cpath and wpath access, but network access will remain forbidden.

Therefore your llamafile is able to protect itself against the outside world, but that doesn't mean you're protected from llamafile. Sandboxing is self-imposed. If you obtained your llamafile from an untrusted source then its author could have simply modified it to not do that. In that case, you can run the untrusted llamafile inside another sandbox, such as a virtual machine, to make sure it behaves how you expect.

Licensing

While the llamafile project is Apache 2.0-licensed, our changes to llama.cpp are licensed under MIT (just like the llama.cpp project itself) so as to remain compatible and upstreamable in the future, should that be desired.

The llamafile logo on this page was generated with the assistance of DALL·E 3.

llamafile's People

Contributors

Stargazers

Watchers

Forkers

alihssan tomchapin bigdatasciencegroup tunahorse spencerfrazier evelynmitchell org-tekeli-borisp antelligent-app kustomzone rantonov trocker triptych passw chungers decentricity apollohuang1 vikgomat richesthumanalive hbcbh1999 suryatmodulus dio shauryashaurya alan-baylis vn-os awesome-openai crosspr0duct sunholo-data novacole ionline247 jomsk1e hendrikmax mivanovitch jithinraj elander develperbayman kokizzu knarkzel santoshdawanse binz120 shrhoads gofrendiasgard nrvo jinjaghost sanmiandresofa gmh5225 sonalranjit trholding richardsonjf zhnathaniellee jansystemic powerbabe vidallia thyarcanist evdcush aigenerative dfrsg jrcribb leejw51 partnerise ai-mou hughes-research maximebodereau zhutony ameurb teamtom bettercallcaleb felbdogg-llc rickyfer22 rajarampanigrahi lyhiving f901107 yxh10 dennyglee azarsenal fengsxy zarlo oushinco preresearch-labs sunsetmkt chuanyi-zjc markusbkk gitsrc jolks jaanhvi18 heliosprimeone repos-ai-local mardak rdavydov minkymorgan moaazsidat pengjukang raffaeleterribile schalise seliasgomez aaaero holyze-68 ajayarunachalam holyswix-w 87cephagne sorokinvj

llamafile's Issues

Windows defender reports Trojan in llamafile-server 0.1

Trojan:Win32/Wacatac.B!ml

Feature request: Information on required hardware

I am something of an LLM newbie, but I love this project. However, both the systems I own are very low end - a laptop running Linux on an AMD Ryzen 3 3250U with Radeon Graphics, and a Raspberry Pi 4B. I suspect both of these would be waaaay too low end to run llamafile.

But it would be great if there could be a few lines in the README saying what the minimum hardware might be?

v0.2 does not start on Windows 10, ResourceUnavailable

llamafile-server-0.2.exe does not start in Windows 10, powershell output:

PowerShell 7.3.10
PS C:\Users\***\Downloads> .\llamafile-server-0.2.exe
ResourceUnavailable: Program 'llamafile-server-0.2.exe' failed to run: An error occurred trying to start process 'C:\Users\***\Downloads\llamafile-server-0.2.exe' with working directory 'C:\Users\***\Downloads'. The specified executable is not a valid application for this OS platform.At line:1 char:1
+ .\llamafile-server-0.2.exe
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~.

Or when just opening directly then blue error:

v0.1 opens correctly in powershell and also when executing directly it does open/close as expected without error.

Used it programmatically

Hi,

How do you run it?
Can we use it programmatically?

llava not using correct system prompt and/or settings

When I launch the current llava-v1.5-7b-q4-server.llamafile, I see a system prompt and default settings that differ from what llava uses for training and inference.

Specifically, I believe the default prompt for llava-v1.5 is A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. with a user name of USER and a bot name of ASSISTANT.

Additionally I noticed settings also seem to differ from what the llava official demo is using - for example temperature is 0.7 instead of 0.2, Top-P is 0.5 instead of 0.7, etc. These are probably not as important as updating the system prompt, but I just thought i would mention it as something to check.

GPU support crashing on Linux in 0.2 releases

When I run llamafile on my system, the model loads fine into my GPU VRAM, however whenever I try to send a prompt llamafile crashes with the following error:

slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 53 tokens
slot 0 : kv cache rm - [0, end)

CUDA error 304 at /home/garcia/.llamafile/ggml-cuda.cu:6006: OS call failed or operation not supported on this OS
current device: 0

This error happens even when -ngl is not set.

Here is some info about my system:

$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] (rev a1)
$ uname -r
6.6.2-101.fc38.x86_64
$ cat /sys/module/nvidia/version
545.29.06
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

Security

Llamafile is a great convenience by bundling the inference code with the weights. However, it offers less security to users in its current form than the use of untrusted safetensors/gguf weights + seperately downloaded (trusted) model impl. If llamafile takes off, users will be executing random executables downloaded from HF generated by random people, presenting a security hole through people including backdoored llama.cpp implementations.

What would be the best way to address these security concerns?

Ultimately, if you bundle the inference code and weights, there is no way to verify anything (since the verifier must be bundled too..), so maybe just educating users is all one can do.
Huggingface could check if .llamafiles are generated correctly, by introspecting the binary to see its just llama.cpp + the weights, but this would be overhead on keeping up with llama.cpp updates. It could also provide some autoconversion from gguf to llamafile, ensuring the security of the output.
One could add norms for verifying signatures of downloaded llamafiles (e.g making the most common copy-pasted commands verify signatures against some PKI), or include a package manager, but the former allows attacks via mis-authentication and the latter via not being the "most convenient route".
The most secure but least convenient outcome would be to have norms of llama.cpp distributed as an APE and have the .gguf downloaded seperately, replacing instances of wget huggingface.com/.../xyz.llamafile && ./xyz.llamafile with curl llamaup.com/up | bash and llamafile huggingface.com/.../xyz.gguf, like the "seperately downloaded llamafile server" example in the docs.

GPU support crash on Linux

My issue looks very much like #39
Except 0.2.1 is not fixing it.

#39
I only have 4Gb of VRAM

llamafile v0.2 on windows 11 and edge browser detect it as virus

I could not download it from my microsoft edge browser as it was complaining of virus

so downloaded with wget.exe and while running it from powershell terminal it it giving following error

"Program 'llamafile-server-0.2.exe' failed to run: Operation did not complete successfully because the file contains a
virus or potentially unwanted softwareAt line:1 char:1

.\llamafile-server-0.2.exe

At line:1 char:1

.\llamafile-server-0.2.exe

  + CategoryInfo          : ResourceUnavailable: (:) [], ApplicationFailedException
  + FullyQualifiedErrorId : NativeCommandFailed

Getting "illegal hardware instruction" error running on M1 MacBook Pro

When I download and run the llamafile from https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4-server.llamafile\?download\=true I get an error like below.

[1] 96405 illegal hardware instruction ./llamafile

Running the executable with the --help option seems to work fine

NixOS support

llamafile does not work on nixos as it seems to assume mkdir binary location.

./llava-v1.5-7b-q4-server.llamafile: line 60: /bin/mkdir: No such file or directory

A workaround is to use steam-run

Coral Support

Is there anyway to run this using a CoraTel USB Accelerator?

Cannot open assembly './mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile': File does not contain a valid CIL image.

Does it think I'm on windows? This is on an old Thinkpad X270 running Ubuntu 22. The workaround with sh -c ./mistral* didn't work (same message), but sh ./mistral* (without -c) did let me run it.

Homebrew formula

Hi,
Will you consider publishing this to Homebrew?

Support for whisper.cpp

Any chance of adding support for whisper.cpp? I know whisper.cpp is still stuck with the GGML format instead of GGUF, but it would be great to have portable whisper binaries that just work.

Request for openhermes-2.5-mistral-7b

Failed to build CUDA

Getting a bunch of the following error when it's compiling the CUDA kernel on windows 11:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\include\cuda_fp16.hpp(1906): error: asm operand type size(8) does not match type/size implied by constraint 'r' 
    asm ("ld.global.nc.b32 %0, [%1];"  : "=r"(*(reinterpret_cast<unsigned int *>(&(ret)))) : "r"(ptr));

Any ideas?

Edit: Not sure if it matters, but also seeing some warning before that:

cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'

[question] Train my own models

Hi is there a way to train my own model to include on one of the base models?
Example llava-v1.5-7b-q4-server.llamafile --external-model my_model.model

Process gets killed immediately by CrowdStrike

Running on Mac 14.1.1, Xcode is installed:

$ xcode-select --install
xcode-select: note: Command line tools are already installed. Use "Software Update" in System Settings or the softwareupdate command line interface to install updates

I downloaded just the llamafile software, following https://github.com/Mozilla-Ocho/llamafile#binary-instructions. Running it results in process getting killed:

$ ./llamafile --help
[1]    4806 killed     ./llamafile --help

Any ideas on how to debug it?

Guide for getting models in the right format?

As a beginner in ML I would really appreciate some help in converting models to the correct format for the llamafile. Like a ready made script or guide for how to convert the common formats. Or maybe a tool like llamafile that can perform the conversion on first use.

I think the idea behind the portable, stand alone application is great, but for beginners it's still limited by the fact that most models are not in the format needed.

Support broken on old Intel/Amd CPUs

Hi,

lscpu gives.....

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         36 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
    CPU family:          6
    Model:               42
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            7
    CPU(s) scaling MHz:  47%
    CPU max MHz:         3400.0000
    CPU min MHz:         1600.0000
    BogoMIPS:            6186.50
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
                         cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnmi
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

after

chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4

Output is....

./llamafile-server-0.1-llava-v1.5-7b-q4
warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
{"timestamp":1701413885,"level":"INFO","function":"main","line":2258,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1701413885,"level":"INFO","function":"main","line":2261,"message":"system info","n_threads":4,"n_threads_batch":-1,"total_threads":4,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
Multi Modal Mode Enabledclip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         19
clip_model_load: ftype:        q4_0

clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: model size:     169.31 MB
clip_model_load: metadata size:  0.14 MB
clip_model_load: total allocated memory: 201.77 MB
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llava-v1.5-7b-Q4_K.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   16:            blk.1.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   25:            blk.2.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   34:            blk.3.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   43:            blk.4.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   52:            blk.5.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   61:            blk.6.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   70:            blk.7.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   79:            blk.8.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   88:            blk.9.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   97:           blk.10.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  106:           blk.11.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  115:           blk.12.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  124:           blk.13.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  133:           blk.14.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  142:           blk.15.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  151:           blk.16.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  160:           blk.17.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  169:           blk.18.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  178:           blk.19.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  187:           blk.20.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  196:           blk.21.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  214:           blk.23.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  223:           blk.24.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  232:           blk.25.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  241:           blk.26.ffn_down.weight q4_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  250:           blk.27.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  259:           blk.28.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  268:           blk.29.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  277:           blk.30.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  286:           blk.31.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  18:               general.quantization_version u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 3891.35 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 162.63 MB
zsh: illegal hardware instruction  ./llamafile-server-0.1-llava-v1.5-7b-q4

llama.log is

[1701413885] Multi Modal Mode Enabled[1701413885] warming up the model with an empty run

run-detectors: unable to find an interpreter for ./llava-v1.5-7b-q4-server.llamafile

Hi there,

I've followed the QuickStart, but unfortunately when I get to step 4 I get the following error:

$ ./llava-v1.5-7b-q4-server.llamafile 
run-detectors: unable to find an interpreter for ./llava-v1.5-7b-q4-server.llamafile

Not sure if I'm missing something? Many thanks for any help (and this amazing lib!)

How to talk to llamafile's OpenAI API endpoint

How to connect to it using API ? i've installed it and it works great but i want to create to it using api

GGML_ASSERT(ggml-metal.m:1645): false

I have Xcode installed, though getting this on my M1, 8GB:

llm_load_tensors: VRAM used: 0.00 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  400.00 MB
llama_build_graph: non-view tensors processed: 924/924
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: loading '/var/folders/_0/00tff1yd64lf3vxzhmldx8100000gn/T/.llamafile/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5461.34 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 81.63 MB
llama_new_context_with_model: max tensor size =   128.18 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  4096.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3533.78 MB, offs =   4160536576, ( 7630.41 /  5461.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   400.02 MB, ( 8030.42 /  5461.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    75.02 MB, ( 8105.44 /  5461.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 2 failed with status 5
GGML_ASSERT: /var/folders/_0/00tff1yd64lf3vxzhmldx8100000gn/T//.llamafile/ggml-metal.m:1645: false
zsh: abort      sh -c ./wizardcoder13b --host 0.0.0.0

Maybe not enough RAM?

HaikuOS

How run this on haiku?

meybe in Serenity os too?

Distribute Docker Images

Dropping the built in UI and concentrating instead on maintaining great OpenAI API compatibility? I ask, because there are already quite a few GUI projects that talk to the OpenAI API, so it could save you perhaps some hassle. Examples include Text Gen Web UI, ChatbotUI, BigAGI-UI etc.
Offering a public Docker image so developers and admins running Kubernetes can more easily package up and deploy this excellent project for their platforms? Projects like LocalAI, TectGenWebUI and Ollama do this.

If you're not into these ideas, no problem - you do you!

Thanks again for what you created here, it's great!

on linux, without changing permissions for the interpreter, bash will throw a permission denied error

From the gotchas, once you install the interpreter to /usr/bin/ape, you also need to chmod +x /usr/bin/ape to get the app to run. this could be added to the readme.md .

Silent mode

Hi,
Can I make my llamafile be silent and not print anything to the CLI except for the generated text?
Thank you

Running on Ubuntu with no GPU

Ubuntu 22.04, 32gb ram, I7 cpu.
downloaded the mistral-7b server tarball. Learned I needed to install the cuda toolkit.
It fails to find a gpu - there is none.
I'm thinking there must be some commandline parameter to allow it to run in some default no-gpu mode.
The ReadMe says this:
In the event that GPU support couldn't be compiled and dynamically linked on the fly for any reason, llamafile will fall back to CPU inference.

It's not doing that. Perhaps the issue lies in booting the server rather than the CLI version?

Add history feature in chat

It's good to add history feature for LLM chat, specifically in server mod, with SQLite.
It is simple to add this feature, but very useful. All the LLM chat software has this feature.

iPhone SDK

Downloaded llava-v1.5-7b-q4-server.llamafile on an Apple M1 MacBook Pro with Xcode and got a fatal error on execution:

➜  Downloads ./llava-v1.5-7b-q4-server.llamafile
/var/folders/x7/kf3hxzfx5ld0665pj3r1npk00000gn/T//.ape-1.9.c:32:10: fatal error: 'sys/random.h' file not found
#include <sys/random.h>
         ^~~~~~~~~~~~~~
1 error generated.

Not sure what might be odd about my system, other than maybe that I've been doing iOS development and xcrun shows this SDK path:

➜  Downloads xcrun --show-sdk-path                                                                                               
/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator17.0.sdk

I'm not a C / C++ daily developer, so this may not be the most helpful info. 😅

OpenAI /completions route fails

Hello,

Thank you the new release including the OpenAI routes but after a try, it always returns the following error using the raw request of the README.md:

llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Users/maxime-georide/Downloads/llamafile-server-0.2 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

I'm on M1 with mistral-7b-instruct-v0.1.Q5_K_M.gguf and llama-2-7b-chat.Q5_K_S.gguf models.

I didn't try with the OpenAI SDK.

library not found: failed to load library - Windows 11, GPU, CUDA 12.3

Hi, thank you for the great project!

I managed to run various models using CPU, but when using GPU I encounter an issue with "library not found". Not sure if I am doing something wrong, but after installing nvcc and added path to cl.exe, the libraries seem to have compiled, but then the error happens. Here is a longer log:

C:\work\llamafile>llamafile.exe -m starling-lm-7b-alpha.Q4_K_M.gguf
extracting /zip/llama.cpp/ggml.h to ./.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to ./.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to ./.llamafile/llamafile.h
extracting /zip/llama.cpp/ggml-cuda.h to ./.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-cuda.cu to ./.llamafile/ggml-cuda.cu
building ggml-cuda with nvcc -arch=native...
ggml-cuda.cu
cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'
ggml-cuda.cu
cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'
ggml-cuda.cu
cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'
tmpxft_00000970_00000000-11_ggml-cuda.cudafe1.cpp
cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'
cl : Command line warning D9002 : ignoring unknown option '-fPIC'
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-mtune=native'
   Creating library .\.llamafile\ggml-cuda.dll.lib and object .\.llamafile\ggml-cuda.dll.exp
**library not found: failed to load library**
{"timestamp":1701310901,"level":"INFO","function":"main","line":2258,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1701310901,"level":"INFO","function":"main","line":2261,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from starling-lm-7b-alpha.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32002,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32 ...

Any thoughts on what could be wrong?

Thank you!!

Draws 100% of one CPU even without input

Executing e.g. ./llamafile-server-0.1-llava-v1.5-7b-q4 or ./llamafile-server-0.1 --model mistral-7b-instruct-v0.1.Q4_K_M.gguf both bring one CPU thread to 100%.

Please tell me what information would be helpful to collect.

System: Debian GNU/Linux 12
Linux xyz 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
CPU:
*-cpu
product: AMD Ryzen 7 PRO 6850U with Radeon Graphics
vendor: Advanced Micro Devices [AMD]
physical id: 1
bus info: cpu@0
version: 25.68.1
size: 2096MHz
capacity: 4767MHz
width: 64 bits
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq
monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa
lignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba
ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1
xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc
_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rd
pid overflow_recov succor smca fsrm cpufreq
configuration: microcode=171983106

Startup message:
warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed {"timestamp":1701331750,"level":"INFO","function":"main","line":2258,"message":"build info","build":1500,"commit":"a30b324"} {"timestamp":1701331750,"level":"INFO","function":"main","line":2261,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_thre ads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

Easiest way to convert safetensors (or do you even have to?)

This project looks very exciting! It looks like it would make experimentation extremely easy.

How would I wire up one of the more recent models (like https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/tree/main) - do I download the three safetensor files and then... convert them? Pass them to the "-m" command line?

(Sorry about the potentially basic question, but making the LLM runner this friendly is going to bring us out of the woodwork!)

How to use Mistral for news summarization

Hi,

First of all thanks for the project, I love it. The Llava model works well for me, but I think I'm doing something wrong with Mistral and when trying to use GGUF models.

I'm trying to get it to summarise a sports article. I tried it with Ollama as a baseline:

$ cat Modelfile
FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf
$ ollama create mistral-gguf -f Modelfile
parsing modelfile
looking for model
creating model layer
creating config layer
using already created layer sha256:14466f9d658bf4a79f96c3f3f22759707c291cac4e62fea625e80c7d32169991
using already created layer sha256:38424ae5ffc5c41a9df1e3e23bfc5307d8cb540b1a4b29f01bbcaba1b380aad8
writing manifest
removing any unused layers
success

$ ollama run mistral-gguf
>>>
>>> The user will provide text that you need to summarize in 3 bullet points. Only use the context provided, don't make things up: Serbia's Hamad Medjedo
... vic became the sixth winner of the Next Gen ATP Finals after a five-set win over France's Arthur Fils.
...
... The 20-year-old world number 111 recovered from letting two match points slip in the fourth set before winning 3-4(6-8) 4-1 4-2 3-4(9-11) 4-1.
...
... The tournament is the season-ending event for the top-ranked male players aged 21 and under.
...
... The event has a best-of-five format with four games winning a set.
...
... Former champions include current top 10 players Carlos Alcaraz, Jannik Sinner and Stefanos Tsitsipas.
...
... This season, Medjedovic reached tour-level semi-finals in Gstaad and Astana and won three ATP Challenger Tour events.
...
... He is the lowest-ranked champion in tournament history but converted his third match point to get past his 19-year-old opponent, ranked 36 in the wor
... ld.
...
... "I can't believe I have won this title but it's going to give me a lot of confidence for 2024," he said.
...
... "Arthur is an amazing player - he's top 40 for a reason - so I'm really happy.
...
... "It was tough after the first set. I changed my clothes and recovered and started to play good again. I didn't play good when I had match points in t
... he fourth set. I wasn't relaxed, I was very stiff.
...
... "Thank God I recovered and I was just trying to stay relaxed as much as I could and I managed to do it in the end."
...
... Medjedovic received messages of support during the week from compatriot Novak Djokovic, who clinched a record seventh ATP Finals title in Turin last
... month.
...
... "Two of us from Serbia. He won the big Masters, the real one, and I won the Next Gen. Obviously, it's a huge thing and I'm happy to follow in his foo
... tsteps in some way," he added.
...
...


- Hamad Medjedovic becomes the sixth winner of the Next Gen ATP Finals after a five-set win over France’s Arthur Fils.
- Medjedovic recovered from letting two match points slip in the fourth set before winning the tournament.
- The 20-year-old world number 111 is the lowest-ranked champion in tournament history.

This is what happens when I try with llamafile:

$ ./mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile

warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
{"timestamp":1701639852,"level":"INFO","function":"main","line":2650,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1701639852,"level":"INFO","function":"main","line":2653,"message":"system info","n_threads":5,"n_threads_batch":-1,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
Apple Metal GPU support successfully loaded
...
llama server listening at http://127.0.0.1:8080

I didn't change any of the defaults on the web UI, I only pasted the same prompt as above.

Any idea what I'm doing wrong?

Cheers, Mark

cannot rename the file on Windows

I cannot rename the file because it says that the file contains the virus or potentially unwanted software

GPU acceleration: "--n-gpu-layers 35" - is 35 good? Should I try different numbers?

I missed the specific commands to enable GPU acceleration, and when I noticed and tried them, it made a HUGE improvement! (words... slowly... typing... to a very-reasonable-reading-speed generation on a 2017 GPU)

... what is that 35? Should I try larger numbers until it breaks?

Or, alternatively, a "noob" mode feature request of "If I don't specify anything, try some reasonable defaults depending on what hardware is available"

trying to host llamafile using apache reverse proxi, no response at all

Hi.
I am trying to host llama file on a debian11 server with apache as a reverse proxi. I supplied the flags "--host 0.0.0.0 --port 57575", the port one due to the fact that port 8080 is already taken on my host. How ever it seams like when ever i try to access the llama file server with apache reverse proxi or directly from the server on port 57575 it gets no response from the server at all and just waits for ever. Can it be that the server is rejecting all traffic from non localhost by default or is there something else i am missing?

GPU numbering on Windows possibly in wrong order

I have multiple NVIDIA GPUs and originally thought it was reporting usage of the wrong one. Now I'm not sure it's using either of them. Is there a way to check for sure, or to pass in preferred device?

Windows 11 session here, x64 native tools command prompt

https://gist.github.com/danbri/d8a387321642b14336701dedf166527f (excerpts only below)

It correctly finds 2 NVIDIA CUDA GPU devices:

ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6

[...]

Later it reports:

ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 8801.76 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB

In the Web UI on :8082 when I start a task, I see the supposedly "main device" GPU (a 3090, external usb box; not the most efficient use of it but hey) at 0% utilization in Task Manager. The built-in NVIDIA appears to be in low level use (4% max) but that seems to be background Window Manager usage. CPU usage goes to 45 or 50% while generating response tokens. Given the "offloading nothing to GPU" log messages, I guess it isn't actually using either NVIDIA GPU, despite noticing them?

If we disconnect the external 3090 NVIDIA GPU, and re-run llamafile, it recognises the remaining internal NVIDIA, and things seem similar except the log now says just

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 8801.76 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/43 layers to GPU

The only processes Task Manager reports for what it calls GPU 1 (the NVIDIA) are Desktop Window Manager and Client Server Runtime Process.

I started out thinking it was using the wrong GPU, I'm not convinced now that either GPU is being used.

Issue to solve ?

Hi there.
I’m quite a noob in AI world, so I don’t see the main problem you try to solve with the current project.
I’m used to get gguf files from huggingface.
What is the issue I have now ?
What will be better with your approach ?

Windows 11 is flagging this as malware.

Following the instructions for Using llamafile with external weights
This link, https://github.com/Mozilla-Ocho/llamafile/releases/download/0.2.1/llamafile-server-0.2.1
Is being flagged as Trojan:Win32/Wacatac.B!ml

Server Missing OpenAI API Support?

The server presents the UI but seems to be missing the APIs?

The example test:

curl -i http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
    "role": "system",
    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
    "role": "user",
    "content": "Write a limerick about python exceptions"
}
]
}'

Results in a 404 error:

HTTP/1.1 404 Not Found
Access-Control-Allow-Headers: content-type
Access-Control-Allow-Origin: *
Content-Length: 14
Content-Type: text/plain
Keep-Alive: timeout=5, max=5
Server: llama.cpp

File Not Found

Windows console usability - tab completion cluttered with ll*-named files

Minor usability thing

llamafile creates a .llamafile directory with compilation artifacts, and a llama.log log file.

These seem to interact awkwardly with tab completion, given that many llamafiles currently seem to have names beginning lla*, e.g. llava-v1.5etc or llamafile-server-0.1etc.

Not really a windows person but in Developer Command Prompt if I type 'll' and hit tab, I see ".llamafile", "llama.log" before it gets to the actual llamafiles. Windows Powershell seems a bit different, ignoring the dotfile ".llamafile" but still showing "llama.log" first.

/usr/bin/nvcc: returned nonzero exit status

extracting /zip/llama.cpp/ggml-cuda.cu to /home/hwu/.llamafile/ggml-cuda.cu
building ggml-cuda with nvcc -arch=native...
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
/usr/bin/nvcc: returned nonzero exit status
building nvidia compute capability detector...
building ggml-cuda with nvcc -arch=compute_61...
nvcc fatal   : Unknown option '-forward-unknown-to-host-compiler'
/usr/bin/nvcc: returned nonzero exit status

nvidia-sml

Sun Dec  3 12:34:40 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    On  | 00000000:2D:00.0 Off |                  N/A |
|  0%   33C    P8               5W / 120W |   1572MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1393      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1584      G   /usr/bin/gnome-shell                          1MiB |
|    0   N/A  N/A      2923      C   ...yenv/versions/3.11.6/bin/python3.11     1556MiB |
+---------------------------------------------------------------------------------------+

Linux hwu-5950 5.15.0-89-generic #99~20.04.1-Ubuntu SMP Thu Nov 2 15:16:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Launch failing on NixOS due to missing /bin/mkdir

Awesome work! Should the following be supported?

$ sh -c ./mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile
./mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile: line 60: /bin/mkdir: No such file or directory

$ which mkdir
/run/current-system/sw/bin/mkdir

With the server binary linked in the README on aarch64 NixOS.

How to tell llamafile to not use system wine on a Linux box with wine installed

Today I tried simonwillison's llamafile example on 2 machines:

On my work machine (Debian 12.2, Gnome, Kitty/Fish, no wine installed), it runs flawlessly
On my home machine (Arch, Gnome, Kitty/Fish, wine installed), it seems to be automatically be detected as a Windows binary, tries to run via wine, and crashes:

20:41:24 ~ ./llamafile-server-0.1-llava-v1.5-7b-q4
0088:err:ntoskrnl:ZwLoadDriver failed to create driver L"\\Registry\\Machine\\System\\CurrentControlSet\\Services\\wineusb": c0000142
003c:fixme:service:scmdatabase_autostart_services Auto-start service L"wineusb" failed to start: 1114
00b0:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
00a8:err:sync:RtlpWaitForCriticalSection section 7BD9A2C0 "../wine/dlls/ntdll/loader.c: loader_section" wait timed out in thread 00a8, blocked by 00b0, retrying (60 sec)
003c:err:service:process_send_command receiving command result timed out
003c:fixme:service:scmdatabase_autostart_services Auto-start service L"WineBus" failed to start: 1053
003c:err:service:process_send_start_message service L"PlugPlay" failed to start
003c:fixme:service:scmdatabase_autostart_services Auto-start service L"PlugPlay" failed to start: 1053
00b0:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
00b0:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
00b0:fixme:hid:handle_IRP_MN_QUERY_ID Unhandled type 00000005
0154:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION
wine: configuration in L"/home/ronj/.wine" has been updated.
Application could not be started, or no application associated with the specified file.
ShellExecuteEx failed: Bad EXE format for Z:\home\ronj\llamafile-server-0.1-llava-v1.5-7b-q4.

If instead of ./llamafile-server-0.1-llava-v1.5-7b-q4 I simply sh ./llamafile-server-0.1-llava-v1.5-7b-q4, things work well.

Is this a known glitch, or a problem with my machine? Can something be done by llamafile packaging / binary build to "de-prioritize" running via wine and prefer running as native Linux? Should README usage suggestion be updated to tell users to sh ./llamafile instead of just ./llamafile ?

MacOS 13.0.1 `exec format error`

Issue

$  chmod +x llamafile-server-0.1-llava-v1.5-7b-q4
$  ./llamafile-server-0.1-llava-v1.5-7b-q4
zsh: exec format error: ./llamafile-server-0.1-llava-v1.5-7b-q4

$ ./llamafile
zsh: exec format error: ./llamafile

Troubleshooting:

downloaded twice to see if file was corrupted

tried from different shell

ensured xcode is installed

xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" in System Settings to install updates

Env

$uname -a
Darwin Laptop.local 22.1.0 Darwin Kernel Version 22.1.0: Sun Oct  9 20:15:09 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T6000 arm64

MacOS  MacOS 13.0.1 

M1 Pro

[NVIDIA cuBLAS GPU] Bad system call

Using n-gpu-layers compiled successfully but I had a Bad system call error:

sh-4.2$ ./mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile --temp 0.7 -r '\n' --n-gpu-layers 35 -p "what is a housekeeper?"
extracting /zip/llama.cpp/ggml.h to /home/ec2-user/.llamafile/ggml.h
extracting /zip/llamafile/compcap.cu to /home/ec2-user/.llamafile/compcap.cu
extracting /zip/llamafile/llamafile.h to /home/ec2-user/.llamafile/llamafile.h
extracting /zip/llama.cpp/ggml-cuda.h to /home/ec2-user/.llamafile/ggml-cuda.h
extracting /zip/llama.cpp/ggml-cuda.cu to /home/ec2-user/.llamafile/ggml-cuda.cu
building ggml-cuda with nvcc -arch=native...
NVIDIA cuBLAS GPU support successfully loaded
Log start
main: build = 1500 (a30b324)
main: built with cosmocc (GCC) 11.2.0 for x86_64-linux-cosmo
main: seed  = 1701700519
Bad system call

Here it is at second call:

sh-4.2$ ./mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile --temp 0.7 -r '\n' --n-gpu-layers 35 -p "what is a housekeeper?"
NVIDIA cuBLAS GPU support successfully loaded
Log start
main: build = 1500 (a30b324)
main: built with cosmocc (GCC) 11.2.0 for x86_64-linux-cosmo
main: seed  = 1701700544
Bad system call

I have nvcc installed in the system PATH:

sh-4.2$ nvcc
nvcc fatal   : No input files specified; use option --help for more information
sh-4.2$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
sh-4.2$

where nvidia-smi is

Every 2.0s: nvidia-smi                                                                                                  Mon Dec  4 14:37:54 2023

Mon Dec  4 14:37:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8              16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Cannot find compiler 'cl.exe' in PATH

Windows 11.

I have cl.exe and cuda in path Visual studio installed, latest cuda. I get this, it's probably because I am not all that bright...

llamafile -m mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile
building ggml-cuda with nvcc -arch=native...
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
/C/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/bin/nvcc.exe: returned nonzero exit status
building nvidia compute capability detector...
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
/C/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/bin/nvcc.exe: returned nonzero exit status
{"timestamp":1701366622,"level":"INFO","function":"main","line":2258,"message":"build info","build":1500,"commit":"a30b324"}
{"timestamp":1701366622,"level":"INFO","function":"main","line":2261,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
gguf_init_from_file: invalid magic characters: MZqF.
error: llama_model_loader: failed to load model from mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile

mozilla-ocho / llamafile Goto Github PK

llamafile's Introduction

llamafile

Quickstart

JSON API Quickstart

Other example llamafiles

How llamafile works

Using llamafile with external weights

Gotchas and troubleshooting

Mac

Mac error "... cannot be opened because the developer cannot be verified"

Linux

Windows

Raspberry Pi

Supported OSes

Supported CPUs

GPU support

Source installation

Creating llamafiles

Distribution

Documentation

Running llamafile with models downloaded by third-party applications

LM Studio

Ollama

Technical details

ZIP weights embedding

Microarchitectural portability

Architecture portability

GPU support

A note about models

Security

Licensing

llamafile's People

Contributors

Stargazers

Watchers

Forkers

llamafile's Issues

Issue

Troubleshooting:

Env

Recommend Projects

Recommend Topics

Recommend Org

Jobs