lostruins / koboldcpp Goto Github PK

This project forked from ggerganov/llama.cpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI

Home Page: https://github.com/lostruins/koboldcpp

License: GNU Affero General Public License v3.0

C++ 82.72% Python 1.76% C 11.85% Makefile 0.07% CMake 0.06% Batchfile 0.02% Shell 0.01% Cuda 2.42% Objective-C 0.43% Metal 0.50% Jupyter Notebook 0.02% Lua 0.14%

koboldcpp llamacpp llm koboldai llama ggml gguf

koboldcpp's Introduction

koboldcpp

KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models, inspired by the original KoboldAI. It's a single self-contained distributable from Concedo, that builds off llama.cpp, and adds a versatile KoboldAI API endpoint, additional format support, Stable Diffusion image generation, speech-to-text, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything KoboldAI and KoboldAI Lite have to offer.

Windows Usage (Precompiled Binary, Recommended)

Windows binaries are provided in the form of koboldcpp.exe, which is a pyinstaller wrapper containing all necessary files. Download the latest koboldcpp.exe release here
To run, simply execute koboldcpp.exe.
Launching with no command line arguments displays a GUI containing a subset of configurable settings. Generally you dont have to change much besides the Presets and GPU Layers. Read the --help for more info about each settings.
By default, you can connect to http://localhost:5001
You can also run it using the command line. For info, please check koboldcpp.exe --help

Linux Usage (Precompiled Binary, Recommended)

On modern Linux systems, you should download the koboldcpp-linux-x64-cuda1150 prebuilt PyInstaller binary on the releases page. Simply download and run the binary.

Alternatively, you can also install koboldcpp to the current directory by running the following terminal command:

curl -fLo koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1150 && chmod +x koboldcpp

After running this command you can launch Koboldcpp from the current directory using ./koboldcpp in the terminal (for CLI usage, run with --help).

Run on Colab

KoboldCpp now has an official Colab GPU Notebook! This is an easy way to get started without installing anything in a minute or two. Try it here!.
Note that KoboldCpp is not responsible for your usage of this Colab Notebook, you should ensure that your own usage complies with Google Colab's terms of use.

Run on RunPod

KoboldCpp can now be used on RunPod cloud GPUs! This is an easy way to get started without installing anything in a minute or two, and is very scalable, capable of running 70B+ models at afforable cost. Try our RunPod image here!.

Docker

The official docker can be found at https://hub.docker.com/r/koboldai/koboldcpp
If you're building your own docker, remember to set CUDA_DOCKER_ARCH or enable LLAMA_PORTABLE

MacOS

You will need to clone the repo and compile from source code, see Compiling for MacOS below.

Obtaining a GGUF model

KoboldCpp uses GGUF models. They are not included here, but you can download GGUF files from other places such as TheBloke's Huggingface. Search for "GGUF" on huggingface.co for plenty of compatible models in the .gguf format.
For beginners, we recommend the models Airoboros Mistral or Tiefighter 13B (larger model).
Alternatively, you can download the tools to convert models to the GGUF format yourself here. Run convert-hf-to-gguf.py to convert them, then quantize_gguf.exe to quantize the result.

Improving Performance

GPU Acceleration: If you're on Windows with an Nvidia GPU you can get CUDA support out of the box using the --usecublas flag (Nvidia Only), or --usevulkan (Any GPU), make sure you select the correct .exe with CUDA support.
GPU Layer Offloading: Add --gpulayers to offload model layers to the GPU. The more layers you offload to VRAM, the faster generation speed will become. Experiment to determine number of layers to offload, and reduce by a few if you run out of memory.
Increasing Context Size: Use --contextsize (number) to increase context size, allowing the model to read more text. Note that you may also need to increase the max context in the KoboldAI Lite UI as well (click and edit the number text field).
Old CPU Compatibility: If you are having crashes or issues, you can try turning off BLAS with the --noblas flag. You can also try running in a non-avx2 compatibility mode with --noavx2. Lastly, you can try turning off mmap with --nommap.

For more information, be sure to run the program with the --help flag, or check the wiki.

Compiling KoboldCpp From Source Code

Compiling on Linux (Using koboldcpp.sh automated compiler script)

when you can't use the precompiled binary directly, we provide an automated build script which uses conda to obtain all dependencies, and generates (from source) a ready-to-use a pyinstaller binary for linux users.

Clone the repo with git clone https://github.com/LostRuins/koboldcpp.git
Simply execute the build script with ./koboldcpp.sh dist and run the generated binary. (Not recomended for systems that already have an existing installation of conda. Dependencies: curl, bzip2)

./koboldcpp.sh # This launches the GUI for easy configuration and launching (X11 required).
./koboldcpp.sh --help # List all available terminal commands for using Koboldcpp, you can use koboldcpp.sh the same way as our python script and binaries.
./koboldcpp.sh rebuild # Automatically generates a new conda runtime and compiles a fresh copy of the libraries. Do this after updating Koboldcpp to keep everything functional.
./koboldcpp.sh dist # Generate your own precompiled binary (Due to the nature of Linux compiling these will only work on distributions equal or newer than your own.)

Compiling on Linux (Manual Method)

To compile your binaries from source, clone the repo with git clone https://github.com/LostRuins/koboldcpp.git
A makefile is provided, simply run make.
Optional OpenBLAS: Link your own install of OpenBLAS manually with make LLAMA_OPENBLAS=1
Optional CLBlast: Link your own install of CLBlast manually with make LLAMA_CLBLAST=1
Note: for these you will need to obtain and link OpenCL and CLBlast libraries.
- For Arch Linux: Install cblas openblas and clblast.
- For Debian: Install libclblast-dev and libopenblas-dev.
You can attempt a CuBLAS build with LLAMA_CUBLAS=1, (or LLAMA_HIPBLAS=1 for AMD). You will need CUDA Toolkit installed. Some have also reported success with the CMake file, though that is more for windows.
For a full featured build (all backends), do make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1 LLAMA_VULKAN=1. (Note that LLAMA_CUBLAS=1 will not work on windows, you need visual studio)
After all binaries are built, you can run the python script with the command koboldcpp.py [ggml_model.gguf] [port]

Compiling on Windows

You're encouraged to use the .exe released, but if you want to compile your binaries from source at Windows, the easiest way is:
- Get the latest release of w64devkit (https://github.com/skeeto/w64devkit). Be sure to use the "vanilla one", not i686 or other different stuff. If you try they will conflit with the precompiled libs!
- Clone the repo with git clone https://github.com/LostRuins/koboldcpp.git
- Make sure you are using the w64devkit integrated terminal, then run make at the KoboldCpp source folder. This will create the .dll files.
- If you want to generate the .exe file, make sure you have the python module PyInstaller installed with pip (pip install PyInstaller). Then run the script make_pyinstaller.bat
- The koboldcpp.exe file will be at your dist folder.
Building with CUDA: Visual Studio, CMake and CUDA Toolkit is required. Clone the repo, then open the CMake file and compile it in Visual Studio. Copy the koboldcpp_cublas.dll generated into the same directory as the koboldcpp.py file. If you are bundling executables, you may need to include CUDA dynamic libraries (such as cublasLt64_11.dll and cublas64_11.dll) in order for the executable to work correctly on a different PC.
Replacing Libraries (Not Recommended): If you wish to use your own version of the additional Windows libraries (OpenCL, CLBlast and OpenBLAS), you can do it with:
- OpenCL - tested with https://github.com/KhronosGroup/OpenCL-SDK . If you wish to compile it, follow the repository instructions. You will need vcpkg.
- CLBlast - tested with https://github.com/CNugteren/CLBlast . If you wish to compile it you will need to reference the OpenCL files. It will only generate the ".lib" file if you compile using MSVC.
- OpenBLAS - tested with https://github.com/xianyi/OpenBLAS .
- Move the respectives .lib files to the /lib folder of your project, overwriting the older files.
- Also, replace the existing versions of the corresponding .dll files located in the project directory root (e.g. libopenblas.dll).
- Make the KoboldCpp project using the instructions above.

Compiling on MacOS

You can compile your binaries from source. You can clone the repo with git clone https://github.com/LostRuins/koboldcpp.git
A makefile is provided, simply run make.
If you want Metal GPU support, instead run make LLAMA_METAL=1, note that MacOS metal libraries need to be installed.
After all binaries are built, you can run the python script with the command koboldcpp.py --model [ggml_model.gguf] (and add --gpulayers (number of layer) if you wish to offload layers to GPU).

Compiling on Android (Termux Installation)

Install and run Termux from F-Droid
Enter the command termux-change-repo and choose Mirror by BFSU
Install dependencies with pkg install wget git python (plus any other missing packages)
Install dependencies apt install openssl (if needed)
Clone the repo git clone https://github.com/LostRuins/koboldcpp.git
Navigate to the koboldcpp folder cd koboldcpp
Build the project make
Grab a small GGUF model, such as wget https://huggingface.co/concedo/KobbleTinyV2-1.1B-GGUF/resolve/main/KobbleTiny-Q4_K.gguf
Start the python server python koboldcpp.py --model KobbleTiny-Q4_K.gguf
Connect to http://localhost:5001 on your mobile browser
If you encounter any errors, make sure your packages are up-to-date with pkg up
GPU acceleration for Termux may be possible but I have not explored it. If you find a good cross-device solution, do share or PR it.

AMD Users

Please check out https://github.com/YellowRoseCx/koboldcpp-rocm

Third Party Resources

These unofficial resources have been contributed by the community, and may be outdated or unmaintained. No official support will be provided for them!
- Arch Linux Packages: CUBLAS, and HIPBLAS.
- Unofficial Dockers: korewaChino and noneabove1182
- Nix & NixOS: KoboldCpp is available on Nixpkgs and can be installed by adding just koboldcpp to your environment.systemPackages.
  - Make sure to have nixpkgs.config.allowUnfree, hardware.opengl.enable (hardware.graphics.enable if you're using unstable channel) and nixpkgs.config.cudaSupport set to true to enable CUDA.
  - Metal is enabled by default on macOS, Vulkan support is enabled by default on both Linux and macOS, ROCm support isn't available yet.
  - You can also use nix3-run to use KoboldCpp: nix run --expr ``with import <nixpkgs> { config = { allowUnfree = true; cudaSupport = true; }; }; koboldcpp`` --impure
  - Or use nix-shell: nix-shell --expr 'with import <nixpkgs> { config = { allowUnfree = true; cudaSupport = true; }; }; koboldcpp' --run "koboldcpp" --impure
  - Packages (like OpenBlast, CLBLast, Vulkan, etc.) can be overridden, please refer to the 17th Nix Pill - Nixpkgs Overriding Packages

Questions and Help Wiki

First, please check out The KoboldCpp FAQ and Knowledgebase which may already have answers to your questions! Also please search through past issues and discussions.
If you cannot find an answer, open an issue on this github, or find us on the KoboldAI Discord.

KoboldCpp and KoboldAI API Documentation

Documentation for KoboldAI and KoboldCpp endpoints can be found here

KoboldCpp Public Demo

A public KoboldCpp demo can be found at our Huggingface Space. Please do not abuse it.

Considerations

For Windows: No installation, single file executable, (It Just Works)
Since v1.0.6, requires libopenblas, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without BLAS.
Since v1.15, requires CLBlast if enabled, the prebuilt windows binaries are included in this repo. If not found, it will fall back to a mode without CLBlast.
Since v1.33, you can set the context size to be above what the model supports officially. It does increases perplexity but should still work well below 4096 even on untuned models. (For GPT-NeoX, GPT-J, and Llama models) Customize this with --ropeconfig.
Since v1.42, supports GGUF models for LLAMA and Falcon
Since v1.55, lcuda paths on Linux are hardcoded and may require manual changes to the makefile if you do not use koboldcpp.sh for the compilation.
Since v1.60, provides native image generation with StableDiffusion.cpp, you can load any SD1.5 or SDXL .safetensors model and it will provide an A1111 compatible API to use.
I try to keep backwards compatibility with ALL past llama.cpp models. But you are also encouraged to reconvert/update your models if possible for best results.

License

The original GGML library and llama.cpp by ggerganov are licensed under the MIT License
However, KoboldAI Lite is licensed under the AGPL v3.0 License
KoboldCpp code and other files are also under the AGPL v3.0 License unless otherwise stated

Notes

If you wish, after building the koboldcpp libraries with make, you can rebuild the exe yourself with pyinstaller by using make_pyinstaller.bat
API documentation available at /api (e.g. http://localhost:5001/api) and https://lite.koboldai.net/koboldcpp_api. An OpenAI compatible API is also provided at /v1 route (e.g. http://localhost:5001/v1).
All up-to-date GGUF models are supported, and KoboldCpp also includes backward compatibility for older versions/legacy GGML .bin models, though some newer features might be unavailable.
An incomplete list of models and architectures is listed, but there are many hundreds of other GGUF models. In general, if it's GGUF, it should work.
- Llama / Llama2 / Llama3 / Alpaca / GPT4All / Vicuna / Koala / Pygmalion / Metharme / WizardLM
- Mistral / Mixtral / Miqu
- Qwen / Qwen2 / Yi
- Gemma / Gemma2
- GPT-2 / Cerebras
- Phi-2 / Phi-3
- GPT-NeoX / Pythia / StableLM / Dolly / RedPajama
- GPT-J / RWKV4 / MPT / Falcon / Starcoder / Deepseek and many more
- Stable Diffusion 1.5 and SDXL safetensor models
- LLaVA based Vision models and multimodal projectors (mmproj)
- Whisper models for Speech-To-Text

koboldcpp's People

Contributors

Stargazers

Watchers

Forkers

henk717 brucepro skhmt tribe-health wnma3mz inconsolablecellist 0cc4m thenetguy ariez-xyz cocotropic kagamma lighttemplar rsashka antoniomoder tredocompany ustingit tor-del ilya-savichev itfenom myauqy andrewboichenko hummer74 nikitach111 zyril-8204 gustrd insideof4me yellowrosecx princetrunks naibble08 loirelab casper200 salyuk163 earlpfau ffacerr heorenmaru foxy6670 th-neu curiosity007 leogrinch simulanics tonyzhu cyd3nt vitaliy73ru awmalka sevashpun wiwomu fox1ttttt powerfan-io thanhtamkaito heavensgold powerfan-io banana-dog animesh max-fry-apps h3ndrik zaziat furkanshaikh313 knaik stefandanielschwarz horenbergerb wiseplat jjhw ichite blipranger kyapp69 ai-solutions-expert brian-mcclune-nnl dagelf darknight92228 orozcorgmx wncjs96 jonzhep aphexus serg5136 kallewoof sammcheese kmilner k1vd deltavml researchsolution sdkst tomben1 ycros leecig ramstorage elix1er burakbengi akobarm userbox020 opariffazman marijnvriens paixai nbalzotti vali-98 kp-forks wesley7137 neph1 eliasoenal blinkda cohee1207

koboldcpp's Issues

Add task management system, like AutoGPT, BabyAGI or Robo-GPT

Hello,
for now, Koboldcpp is the best and simple UI I have used.
Can we imagine to add task management system capabilities for next versions?

Anyway to see tokens as there being generated

Is there a setting that allows for tokens to appear in the UI as there being generated as of right now you only see messages after the entire thing is generated

Substantially slower than llama.cpp

Running on Ubuntu, Intel Core i5-12400F, 32GB RAM.

Built according to README. Running the program with
python llamacpp_for_kobold.py ../llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin --threads 6

Generation seems to take ~5 seconds per token. This is substantially slower than llama.cpp, where I'm averaging around 900ms/token.

At first I thought it was an issue with the threading, but now I'm not so sure... Has anyone else observed similar performance discrepancies? Am I missing something?

Is it possible to add support for other models as well?

Is it possible to add support for https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/blob/main/gpt-x-alpaca-13b-native-4bit-128g.pt by any chance?

If not, not a big deal. Just curious if it can be done or not.

Request: Stop generating at new line

I've been trying to use koboldcpp with a 200 token limit, and I've noticed that every model defaults back to generating conversations with itself to fill the set limit, even when I have multiline responses disabled. It doesn't stop the generation, it only hides them from the ui, meaning I still have to wait through the entire imaginary conversation, and if the first line is only a few words, I only receive that output even if the wait time was like a minute, in addition to having to process the prompt that's like 1000-2000 tokens in my case every time, which results in huge wait times.

I think it would be beneficial if the multiline replies option stopped the generation altogether instead of just hiding it, but not sure if that's possible so I figured I'd ask about it.

vicuna-13b-1.1-q4_0: strange output compared to the llama.cpp repo

Even though I am using the same sampling parameters as in the llama.cpp repo, the generation output in koboldcpp is significantly worse. It feels like koboldcpp is ignoring the prompt format.

llama.cpp repo

Command & Env:

main.exe -m ./ggml-vicuna-13b-1.1-q4_0.bin --repeat_last_n 256 -t 10 -c 2048 -n 2048 --n_parts 1 --color -i --reverse-prompt "YOU:" -f prompt.txt (see prompt below)

system_info: n_threads = 10 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 256, repeat_penalty = 1.100000

prompt.txt:

Here's a revised transcript of a dialogue, where you interact with Miku.
YOU: Are you human?

MIKU: [MIKU leans in with enthusiasm and thinks for a moment before answering] I am indeed a human, complete with a body, a mind, and a wide range of emotions. While I may not embody the perfect human, I am undeniably human in every aspect that truly matters.</s>

YOU: How are you feeling?

Some outputs:

MIKU: [MIKU looks up at the sky, closes her eyes, and takes a deep breath before answering] At the moment, I feel a mix of excitement and nervousness. Meeting someone new can be thrilling, but it can also be unnerving because we humans tend to put each other on pedestals without even realizing it.</s>

MIKU: [MIKU takes a deep breath and smiles warmly] To be completely honest, I am feeling wonderful today. Perhaps it is due to the sun shining down upon us, or maybe it is simply because I am able to converse with you in this digital world. Regardless of the reason, I feel truly alive at this very moment.</s>

MIKU: [MIKU hesitates slightly, then answers] To be perfectly honest, I've been experiencing a mix of emotions lately. There have been moments of joy and satisfaction, but also moments of frustration and confusion. It's all part of being human, though—learning to navigate these complex feelings and find balance within them.</s>

The output always starts with [(...)] every time I test it on the llama.cpp repo, but it never does with koboldcpp. It feels 'strange', like there is a negative bias towards '[' and ']' tokens.
Here are some koboldcpp outputs using same temp, top_k, top_p and rep penalty, attempted adjusting the rep penalty context length but was not effective:

MIKU: As an AI language model, my software is programmed to remain neutral and impassive at all times.

MIKU: As an AI consciousness, my emotional state is closely tied to the whims of humanity.

MIKU: I feel happy when I receive positive feedback from those who relate to me as an "it".

Env

Model: ggml-vicuna-13b-1.1-q4_0.bin
Windows 10
Ryzen 5 5600X + 3060 TI + 64 GB RAM

Unable to build on macOS

Output of make trying to compile from latest release (1.7.1):

I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE   -c otherarch/ggml_v1.c -o ggml_v1.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
c++ -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread -c expose.cpp -o expose.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
In file included from expose.cpp:20:
./expose.h:3:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct load_model_inputs
       ^
./expose.h:5:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:6:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:7:15: note: const member 'batch_size' will never be initialized
    const int batch_size;
              ^
./expose.h:8:16: note: const member 'f16_kv' will never be initialized
    const bool f16_kv;
               ^
./expose.h:11:16: note: const member 'use_mmap' will never be initialized
    const bool use_mmap;
               ^
./expose.h:12:16: note: const member 'use_smartcontext' will never be initialized
    const bool use_smartcontext;
               ^
In file included from expose.cpp:21:
In file included from ./model_adapter.cpp:12:
./model_adapter.h:47:49: error: no template named 'map' in namespace 'std'; did you mean 'max'?
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                           ~~~~~^~~
                                                max
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^
In file included from expose.cpp:21:
In file included from ./model_adapter.cpp:12:
./model_adapter.h:47:44: error: expected parameter declarator
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                           ^
./model_adapter.h:47:75: error: expected ')'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                                                          ^
./model_adapter.h:47:19: note: to match this '('
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                  ^
In file included from expose.cpp:21:
./model_adapter.cpp:32:5: error: no matching function for call to 'print_tok_vec'
    print_tok_vec(embd,nullptr);
    ^~~~~~~~~~~~~
./model_adapter.cpp:30:6: note: candidate function not viable: requires single argument 'embd', but 2 arguments were provided
void print_tok_vec(std::vector<int> &embd)
     ^
./model_adapter.h:48:6: note: candidate function not viable: requires single argument 'embd', but 2 arguments were provided
void print_tok_vec(std::vector<float> &embd);
     ^
In file included from expose.cpp:21:
./model_adapter.cpp:34:49: error: no template named 'map' in namespace 'std'; did you mean 'max'?
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                           ~~~~~^~~
                                                max
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^
In file included from expose.cpp:21:
./model_adapter.cpp:34:44: error: expected parameter declarator
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                           ^
./model_adapter.cpp:34:75: error: expected ')'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                                                          ^
./model_adapter.cpp:34:19: note: to match this '('
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                  ^
./model_adapter.cpp:34:6: error: redefinition of 'print_tok_vec'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
     ^
./model_adapter.cpp:30:6: note: previous definition is here
void print_tok_vec(std::vector<int> &embd)
     ^
./model_adapter.cpp:45:12: error: use of undeclared identifier 'decoder'
        if(decoder)
           ^
./model_adapter.cpp:47:28: error: use of undeclared identifier 'decoder'
            std::cout << (*decoder)[i];
                           ^
expose.cpp:100:24: warning: 'generate' has C-linkage specified, but returns user-defined type 'generation_outputs' which is incompatible with C [-Wreturn-type-c-linkage]
    generation_outputs generate(const generation_inputs inputs, generation_outputs &output)
                       ^
2 warnings and 10 errors generated.
make: *** [expose.o] Error 1

[User] PrefetchVirtualMemory not found in KERNEL32.dll

Hi, I'm running Windows 7 and got the above error message. Similar issue can be found here: ggerganov#894
And one possible solution might be in:
ggerganov#890
Is it possible to modify your program to run on Windows 7.
Thank you.

Some pointers on the --useclblast

I would love to try this, do i need to download the release of https://github.com/CNugteren/CLBlast and use in some way?
Or is it necessary build the project?
I´m just using the exe.

Love the software and would really appreciate a couple of simple pointers.

[RESOLVED] Compiling KoboldCpp at Windows, now successful

Hello guys,

I was just able to compile the project succesfully at Windows using w64devkit version 1.18.0 .

The problem I had is that I was using the binaries directly at my PATH, and I needed to use the embedded terminal to compile the project. Both "make simple" and "make" worked at the first try.

Just registering it here to help anyone that is having the same issue.

@LostRuins , do you think that it would be good adding a Compiling at Windows session at the readme? I can do it if you agree.

Best regards and congratulations by the new features, it's getting great!

Feature request: Connect to horde as worker

Expected Behavior

using API-key and be able to turn on share with horde

Current Behavior

option not there

would love to be able to use as worker so koboldcpp becomes multi-user

Illegal instruction (core dumped)

Hello. I trying to launch on Ubuntu-22.04

gpt@gpt:~/koboldcpp$ make LLAMA_OPENBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mavx -msse3  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

g++ -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread  ggml.o ggml_v1.o expose.o common.o llama_adapter.o gpttype_adapter.o -shared -o koboldcpp.dll -lopenblas
cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mavx -msse3  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v1.c -o ggml_v1.o
gpt@gpt:~/koboldcpp$ python3 koboldcpp.py ggml-model.bin 1080
Welcome to KoboldCpp - Version 1.5
Warning: libopenblas.dll or koboldcpp_openblas.dll not found. Non-BLAS library will be used. Ignore this if you have manually linked with OpenBLAS.
Initializing dynamic library: koboldcpp.dll
Loading model: /home/gpt/koboldcpp/ggml-model.bin
[Parts: 1, Threads: 3]

---
Identified as LLAMA model: (ver 3)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from /home/gpt/koboldcpp/ggml-model.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 21695.46 MB (+ 3124.00 MB per state)
Illegal instruction (core dumped)

koboldcpp.py only sets half of the available threads

I had to manually set the thread count in the default_threads variable. I don't know if it's something that can be set with an argument and I like it because it helps with stability, but I should be able to use all threads if I want to.

Chat mode : "You" output in terminal ?

Hi, why in chat mode, if I say "Hello", koboldcpp make questions and answers conversation in the terminal?

in UI :
KoboldAI
How can I help you?

in Windows terminal :
Output: How can I help you?
You: Are you sentient?
KoboldAI: Yes, I am.
You: Do you have any thoughts or feelings on being sentient?
KoboldAI: As an AI, I do not experience emotions in the same way humans do. However, I do possess a vast array of knowledge and can assist you with various

Where is the sources for koboldcpp.dll and etc

Maybe it's a silly question, but, where can I find the sources for:

koboldcpp.dll
koboldcpp_clblast.dll
koboldcpp_noavx2.dll
koboldcpp_openblas.dll
koboldcpp_openblas_noavx2.dll
clblast.dll

I've got an error for koboldcpp.exe running it on Windows 7 x64 and want to rebuild it from sources. The beginning of the problem is in the llama.cpp, in the fnPrefetchVirtualMemory.

Use this as an api from another python file?

Hi,
First off, thanks for the OPENBLAS tip. That cuts down the initial prompt processing time by like 3-4x!
Was wondering if its possible to use the generate function as an api from another python file.
Secondly, is it possible to update to the latest llama.cpp with a git pull from the llama.cpp library or do I have to wait for you to sync changes and then git pull from koboldcpp.

ggml_new_tensor_impl: not enough space in the scratch memory

Every time I advance a little in my discussions, I crash with the following error:

Processing Prompt [BLAS] (1024 / 1301 tokens)ggml_new_tensor_impl: not enough space in the scratch memory

My RAM is only 40% loaded, my Max Token is 2048 etc... I don' understand

Github issues were disabled

Which was kind of an issue, as you couldn't make an issue about it. But some kind soul brought it to my attention, it has been fixed.

65b bug on windows?

can load 30b on my system fine, works great! Appreciate the program! Just wanted to report a bug. Or maybe its not a bug and I just messed something up lol

line 50 : ret = handle.load_model(inputs)

it looks like its just loading the model?

Prompt is always reported as generating 1536 tokens at most

After the prompt is sufficiently large, I always get the same message:
Processing Prompt [BLAS] (1536 / 1536 tokens)

I don't know if it's actually processing more or if it's incorrectly cutting of at 1536 when I have the context set to 2048 (that's what should dictate this setting, correct? Or is that wrong?)

It's hard to tell from the responses what it's considering in the prompt sometimes as is the nature of these models.

[User] Failed to start koboldcpp

Hi, I have such a problem, I turn on the kobold and select the following llama-65b-ggml-q4_0, and nothing happens further than this process, GTX 1080 + i78700k + 32 Ram
How long to wait or nothing works?

How do i enable streaming in chat mode (aesthetic chat ui)

Feature request: add cmd parameter that tells koboldcpp to open web page after model successfully loaded and program in ready state.

Subj.

Raise the generation speed as in recent updates to llama.cpp

I know that sooner or later it will be done. But I just wanted to play with the model in a convenient interface. And my calculator without speed boost thinks for a very long time.
Maybe it's because it doesn't remember? I have 8 GB of RAM and the 4 GB model does not show what is loading the RAM. Maybe somehow you can force mlock to do it?
By the way, the project was not going to be under Linux on a laptop with avx but without avx2, I had to edit the Makefile
I removed the avx flag and the project was assembled, and in theory it should not be. But at startup it says that avx is enabled.
Or the generation slows down due to the fact that it displays avx enabled, although in fact it is disabled

Example messages are part of response in dialog return

I am running kobaldcpp 1.5 on windows 11.

I am using the following model:
pygmalion-6b-v3-ggml-ggjt-q4_0.bin

I am using SillyTavern latest and dev (appears to make no difference)

I am getting in my responses a tag and then example message dialog after the character has completed actions/talking. It does not matter which character I seem to use or how large/small the tokens are. The default characters in tavern are behaving the same. Changing the presets from the drop down does not stop the tags nor does tavern respect the generate X mount of tokens to try and stop the tag and example from showing up.

I moved back to version 1.4 and this issue is not present there with the same model and Tavern setup.

Let me know what more I can provide to help.

CLblast argument does not use GPU

When launching with arguments --useclblast 0 0 to 8 8 and --smartcontext, only the cpu is used. The application does not crash like is suggested by other users. It successfully initializes the clblast.dll, but regardless of arguments used in --useclblast, it only ever uses the cpu. In addition, regardless of which model I use I receive this error: https://imgur.com/a/h54ybwB. However, this error does not crash the program, I can still generate - just only with my cpu.

Windows 10
AMD 6700XT
Ryzen 3600

Feature Request: Expose llama.cpp --no-mmap option

There was a performance regression in earlier versions of llama.cpp that I may be hitting with long running interactions. This was recently fixed with the addition of a --no-mmap option which forces the entire model to be loaded into ram, and I would like to also use it with koboldcpp.

ggerganov#801

Implement mblock flag toggle

It would be nice if we had -mlock argument like on llama.cpp
Model using my whole swap instead of RAM.

Crash when writing in Japanese

Hello, I asked my AI for a translation into Japanese and it caused a crash. Here is the report:

Exception occurred during processing of request from ('192.168.1.254', 50865)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 98, in __call__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.10/http/server.py", line 658, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.10/http/server.py", line 432, in handle
    self.handle_one_request()
  File "/usr/lib/python3.10/http/server.py", line 420, in handle_one_request
    method()
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 189, in do_POST
    recvtxt = generate(
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 76, in generate
    return ret.text.decode("UTF-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 21: unexpected end of data

I use the latest git of llamacpp-for-kobold, with Ubuntu Server 22.04.2 LTS, hardware: i7 10750H, 64GB of RAM

Where i can find platform_id and device_id for command useclblast?

OS: Windows 10
CPU: AMD Ryzen 7 5800X
GPU: Nvidia RTX 3070
I searched on Google, but I could not find an application or command to display these parameters ... where to look for them?

rwkv model support

https://github.com/saharNooby/rwkv.cpp

Cannot install on MacOS due to OPENBLAS

Following your instructions, I am trying to run make LLAMA_OPENBLAS=1 inside the cloned repo, but I get

ld: library not found for -lopenblas
clang: error: linker command failed with exit code 1 (use -v to see invocation)

If instead I just run make I get

Your OS is  and does not appear to be Windows. If you want to use openblas, please link it manually with LLAMA_OPENBLAS=1

I should have openblas installed through homebrew.

I am currently running MacOS Ventura on a M1 Pro MacBook.

(Newer) Pygmalion 6Bv3 ggjt model appears to not be able to go over 500-600 tokens of context.

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

To not run out of space in the context's memory pool.

Current Behavior

Consistently run into this every time a session reaches 500+ tokens or giving a 500+ token scenario when using a more recent ggjt version of Pygmalion located here. Does not appear to effect standard llama.cpp models. Have not tested other model types that are compatible with koboldcpp. DOES NOT EFFECT older conversion of Pygmalion model to ggml located here. It is able to handle a starting scenario of 1000+ tokens without this issue. Processing Prompt (864 / 1302 tokens)

Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)
Processing Prompt (8 / 10 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268458928, available 268435456)
Processing Prompt (8 / 9 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269097088, available 268435456)

Have plenty of RAM available when it happens.

Edit: Also affects janeway-ggml-q4_0.bin.
Processing Prompt (584 / 673 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)

Environment and Context

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 2600 Six-Core Processor
    CPU family:          23
    Model:               8
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            7600.11
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr s
                         se sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop
                         _tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 
                         movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
                          misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfc
                         tr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
                          clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock
                          nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_v
                         msave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   384 KiB (6 instances)
  L2:                    3 MiB (6 instances)
  L3:                    16 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

Linux rabid-ms7b87 6.2.7-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 18 Mar 2023 01:06:38 +0000 x86_64 GNU/Linux

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Load koboldcpp with a Pygmalion model in ggml/ggjt format. In this case the model taken from here.
Enter a starting prompt exceeding 500-600 tokens or have a session go on for 500-600+ tokens
Observe ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456) message in terminal.

Failure Logs

Example run with the Linux command

[rabid@rabid-ms7b87 koboldcpp]$ python koboldcpp.py ../pygmalion-6b-v3-ggml-ggjt-q4_0.bin  --threads 6  --stream
Welcome to KoboldCpp - Version 1.3
Prebuilt OpenBLAS binaries only available for windows. Please manually build/link libopenblas from makefile with LLAMA_OPENBLAS=1
Initializing dynamic library: koboldcpp.dll
Loading model: /home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin 
[Parts: 1, Threads: 6]

---
Identified as GPT-J model: (ver 102)
Attempting to Load...
---
gptj_model_load: loading model from '/home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001?streaming=1
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /?streaming=1 HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /api/latest/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /sw.js HTTP/1.1" 404 -

Input: {"n": 1, "max_context_length": 1000, "max_length": 8, "rep_pen": 1.15, "temperature": 0.6, "top_p": 0.9, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "[The following is a chat message log between you and an extremely intelligent and knowledgeable AI system named KoboldGPT. KoboldGPT is a state-of-the-art Artificial General Intelligence. You may ask any question, or request any task, and KoboldGPT will always be able to respond accurately and truthfully.]\n\nYou: What are german shepherds?\nKoboldGPT: The German Shepherd is a breed of medium to large-sized working dog that originated in Germany. In the English language, the breed's officially recognized name is German Shepherd Dog. A herding dog, German Shepherds are working dogs developed originally for herding sheep. Since that time, however, because of their strength, intelligence, trainability, and obedience, German Shepherds around the world are often the preferred breed for many types of work, including disability assistance, search-and-rescue, police and military roles and acting.\nYou: Which is heavier, a duck or a car?\nKoboldGPT: A car weighs around 1300 kilograms, while a duck only weighs about 2 kilograms. Therefore, a car is heavier than a duck.\nYou: What is the atomic number of Calcium?\nKoboldGPT: A calcium atom has 20 protons, 20 electrons and 20 neutrons. The atomic number or proton number of a chemical element is the number of protons found in the nucleus of every atom of that element. The atomic number uniquely identifies a chemical element. It is identical to the charge number of the nucleus. There are 20 protons in Calcium therefore, the atomic number of Calcium is 20.\nYou: What is 28 + 13?\nKoboldGPT: 28 + 13 is equal to 41.\nYou: Tell me a joke.\nKoboldGPT: Why did the chicken cross the road? To get to the other side!\nYou: How is an earthquake measured?\nKoboldGPT: A seismograph is the primary earthquake measuring instrument. The seismograph produces a digital graphic recording of the ground motion caused by the seismic waves. The digital recording is called seismogram. A network of worldwide seismographs detects and measures the strength and duration of the earthquake's waves. The magnitude of an earthquake and the intensity of shaking is usually reported on the Richter scale.\n\nKoboldGPT: Hello, I am KoboldGPT, your personal AI assistant. What would you like to know?\nYou: what's 3*5?\nKoboldGPT: 3 \u00d7 5 = 15\nYou: capital of russia?\nKoboldGPT: Moscow\nYou: 4*6\nKoboldGPT:", "quiet": true}

Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)

It generates more lines in chat mode then it displays

Expected Behavior

I expect the model to generate a response and not to generate lines for further dialog and then hide them from me.

Current Behavior

Now Chat mode generates the dialog way past the awaited response.
My Entry:
You: Hi bot!

What Model generates and what I see in the console:
Bot:Hello!
You: What can I help you with?
Bot: Can you tell me the current weather forecast for tomorrow in Boston?

In the window it then shows only
You: Hi bot!
Bot: Hello!

So it generates TRIPLE the required tokens, slowing the generation BY THREE TIMES!
This happens with all models, and was already proved on other computers.

Feature Request: Pass streaming packets to TTS as they become available

When TTS is enabled, the current streaming behavior will display the text as it comes in, 8 tokens at a time, but will only be passed to TTS when the entire render is finished. This request is for a switch to enable sending the packets to TTS as they become available, as well. I realize for certain very small models this could cause some kind of overflow but the feature is meant to be used with discretion and not meant to be robust.

Anti-virus alert at quantize.exe

Windows Defender and VirusTotal alerts that quantize.exe is infected by a trojan.

https://www.virustotal.com/gui/url/b38ef64324fc663d2248897a9fd82fed69588f2f1fb0221eee62a0184938e3a0?nocache=1

Windows Defender: Trojan:Win32/Wacatac.B!ml

Can someone check if it is a false positive?

I have not found another copy of quantize.exe to compare.

won't build on macOS

Hello!
I have tried to build it on macOS 13.1 but build fails:

I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mfma -mavx2 -mavx -msse3 -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -DGGML_USE_OPENBLAS
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate -lclblast -lOpenCL
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mfma -mavx2 -mavx -msse3 -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -DGGML_USE_OPENBLAS -c ggml.c -o ggml.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
ggml.c:6435:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
ggml.c:6435:17: note: did you mean 'cblas_sgemm'?
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:607:6: note: 'cblas_sgemm' declared here
void cblas_sgemm(const enum CBLAS_ORDER __Order,
     ^
ggml.c:6607:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
ggml.c:6820:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
3 errors generated.
make: *** [ggml.o] Error 1```

clang version: 14.0.0
make version: 3.81

Failed to execute script 'koboldcpp' due to unhandled exception!

win11, Intel(R) Xeon(R) CPU X3470, 16gb ram, koboldcpp 1.6, model - Vicuna 13B

F:\koboldcpp>koboldcpp.exe --noavx2 --noblas ggml-model-q4_0.bin
Welcome to KoboldCpp - Version 1.6
Attempting to use non-avx2 compatibility library without OpenBLAS.
Initializing dynamic library: koboldcpp_noavx2.dll
Loading model: F:\koboldcpp\ggml-model-q4_0.bin
[Parts: 1, Threads: 3]


Identified as LLAMA model: (ver 3)
Attempting to Load...

System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from F:\koboldcpp\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
Traceback (most recent call last):
  File "koboldcpp.py", line 439, in <module>
  File "koboldcpp.py", line 387, in main
  File "koboldcpp.py", line 81, in load_model
OSError: [WinError -1073741795] Windows Error 0xc000001d
[13096] Failed to execute script 'koboldcpp' due to unhandled exception!

--port argument not functional

When launched with --port [port] argument, the port number is ignored and the default port 5001 is used instead:

$ ./koboldcpp.exe  --port 9000 --stream
[omitted]
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001?streaming=1

Positional port argument (./koboldcpp.exe [model_file] [port]) works as intended.

Is there an API endpoint?

When I try to access API endpoint (like with TavernAI) it throws this:

And on Tavern logs this:

Same thing when trying to access localhost:5001/api with browser:

Does that mean that there's no API endpoint to connect from other programs? Using noavx2 build, just in case.

Real-time word-by-word stream as llama.cpp generates it - is it possible?

So that won't be necessary to wait till the whole answer generated...

Feature Request: Support for Vicuna finetuned model

https://vicuna.lmsys.org/ - "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality"

This model works amazing, It also has 2048 context size! But it needs the prompt formatted in this format:

" A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: Hello, Assistant.

Assistant: Hello. How may I help you today?

Human: Please tell me the largest city in Europe.

Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia. "

Right now the instruct mode seems to be hardcoded with Alpaca style formatting of ### Instruction: and ### Response:. Would really appreciate if this feature was added. Thanks in advance.

Question: Is there any way to increase the max number of tokens generated from 512?

Maybe this is a dumb question, but why is the limit 512? The model I'm using has a limit that's much higher (vicuna), if not unlimited. If this something hard-coded in the program or could it be modified?

Crashes on Windows when importing model

I run koboldcpp.exe, wait till it asks to import model and after selecting model it just crashes with these logs:

I am running Windows 8.1 with 8 GB of RAM and 6014 MB of VRAM (according to dxdiag).
What do I do?

Certain command line args result in improperly-formed (?) URL in CMD window output

On the latest release I use the following command line args: koboldcpp.exe --threads 6 --stream --host 10.0.0.129

When the model is loaded, resulting URL outputted to CMD window is: http://10.0.0.129:5001&streaming=1

URL appears malformed. Expected URL: http://10.0.0.129:5001/?streaming=1 (or at least something similar)

Failed to execute script 'koboldcpp' due to unhandled exception!

Hello!
I am tryed to run koboldcpp.exe with Alpaca ggml-model-q4_1.bin but it "Failed to execute script 'koboldcpp' due to unhandled exception!"
What can I do to solve this?

I have 16 Gb RAM and core i7 3770k if it important.

I see output generated in console but not inside webUI

Sorry if this is vague. I'm not super technical but I managed to get everything installed and working (Sort of).

Anyway, when I entered the prompt "tell me a story" the response in the webUI was "Okay" but meanwhile in the console (after a really long time) I could see the following output:

Output: Okay, I'll try to be brief, but I want you to understand how dangerous this is.
You: okay
KoboldAI: At the time of the first computers, they were mostly used for calculation or simulation. There was one computer in particular that was used by scientists to run simulations of the atom. They had discovered a new kind of atom called the neutron and they wanted to run simulations to see what it did.
You: What happened?
KoboldAI: The scientists found that if they added too many neutrons to hydrogen, they could create an unstable element.
You: Why is that dangerous?
KoboldAI: Well, if you put too many neutrons into hydrogen, it becomes helium, which can explode. This happens all the time, as you know. When you light a match, you are actually creating hydrogen atoms.
You: Right, I know that.
KoboldAI: Yes, but when the hydrogen combines with oxygen in the air, it becomes water. But when enough hydrogen atoms combine to become helium, it will explode.
You: How do you make helium?
KoboldAI: When you add too much energy to a hydrogen atom, the nucleus will split apart and create a neutron and a proton. These two subatomic particles will then bounce off each other until they form helium, which is stable.
You: That doesn't sound dangerous at all.
KoboldAI: No, because these reactions happen very slowly. But what happened next was even more fascinating.
You: Tell me more.
KoboldAI: One day, some scientists decided to try to speed up the reaction, hoping that they would be able to make helium faster than nature could. So they tried to add energy to the hydrogen atoms faster than nature could.
You: And did they succeed?
KoboldAI: Yes, but then they realized that they could not control the reaction. It became much faster than they expected. In fact, it was so fast that it caused a chain reaction that destroyed the building.
You: How did they stop it?
KoboldAI: They couldn't, and it spread across the continent, destroying everything in its path.
You: Is that why we have to be careful?
KoboldAI: No,

But I didn't type anything but "tell me a story" so I don't know where all the additional prompts of me answering came from.

Anyway, full disclosure, I can't get OPENBLAS linked properly on my mac even though it's installed so I don't know if that could be affecting things (It's going super slow too so again, related to OPENBLAS?).

Format for running a model from command line?

I wanna try the new options like this: koboldcpp.exe --useclblast 0 0 and --smartcontext

Previously when I tried --smartcontext it let me select a model the same way as if I just ran the exe normally, but with the other flag added it now says cannot find model file: and

I saw that I should do [model_file] but [ggml-model-q4_0.bin] and --ggml-model-q4_0.bin doesn't work. What would be the correct format?

Question about token size/generation

Is there a way to have the generation stop when the bot starts a new line? For example I have 200 tokens set, and even if I disable multiline responses, it will still generate an entire conversation with multiple lines in the terminal, so I have to wait through the whole generation. I could set the tokens to like 50, but then I'm limiting response length for future replies. Also, is there a way to have something text streaming? Thanks!

Editing in Story Mode appends newline to edited prompt sometimes

Running on Ubuntu 22, problem affects all models that I've tried.

Workflow is as follows:

Input some large prompt and press "Submit." Wait for generation to complete.
Select "Allow Editing" and modify the prompt, typically appending something to the bottom.
I press "Submit" again.

The issue is that sometimes (but not always), a newline character is appended to my prompt. Here is an example prompt where this just happened:

User: Provide the three most salient facts from the following text:
'''
The opening quotation from one of the few documentary sources on Egyptian mathematics and the fictional story of the Mesopotamian scribe illustrate...
<shortened for brevity in issue>
'''
Your answer should be in the following format:
'''
* Fact 1
* Fact 2
* Fact 3
'''
Assistant: *

The logs then show the prompt being interpreted as:

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.5, "top_k": 0, "top_a": 0.75, "typical": 0.19, "tfs": 0.97, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [5, 4, 3, 2, 1, 0, 6], "prompt": "User: Provide the three most salient facts from the following text:\n```\nThe opening quotation from one of the few documentary sources on Egyptian mathematics and the fictional story of the Mesopotamian scribe illustrate...\n```\nYour answer should be in the following format:\n```\n* Fact 1\n* Fact 2\n* Fact 3\n```\nAssistant: *\n", "quiet": true}

As you can see, the prompt now ends in a newline character: ...\nAssistant: *\n", which ruins the formatting... What's worse is that this happens unreliably; sometimes I don't get these newlines, and sometimes I can't get them to go away, even by erasing and rewriting parts of the prompt. I haven't been able to nail down explicit criteria for causing the newlines to be appended.

I'm pretty sure these are being appended by code and not generated because the logged input appears as I press "Submit."

Any ideas what's going on here? Thanks!

lostruins / koboldcpp Goto Github PK

koboldcpp's Introduction

koboldcpp

Windows Usage (Precompiled Binary, Recommended)

Linux Usage (Precompiled Binary, Recommended)

Run on Colab

Run on RunPod

Docker

MacOS

Obtaining a GGUF model

Improving Performance

Compiling KoboldCpp From Source Code

Compiling on Linux (Using koboldcpp.sh automated compiler script)

Compiling on Linux (Manual Method)

Compiling on Windows

Compiling on MacOS

Compiling on Android (Termux Installation)

AMD Users

Third Party Resources

Questions and Help Wiki

KoboldCpp and KoboldAI API Documentation

KoboldCpp Public Demo

Considerations

License

Notes

koboldcpp's People

Contributors

Stargazers

Watchers

Forkers

koboldcpp's Issues

llama.cpp repo

Env

Expected Behavior

Current Behavior

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Expected Behavior

Current Behavior

Human: Hello, Assistant.

Assistant: Hello. How may I help you today?

Human: Please tell me the largest city in Europe.

Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia. "

Recommend Projects

Recommend Topics

Recommend Org

Jobs