GithubHelp home page GithubHelp logo

daskol / llama.py Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 2.0 545 KB

Python bindings to llama.cpp

License: MIT License

Dockerfile 0.12% Shell 0.45% CMake 1.20% Python 3.65% C 73.84% C++ 20.74%
alpaca language-model llama antihype llama-cpp

llama.py's Introduction

llama.py

llama.py is a fork of llama.cpp which provides Python bindings to an inference runtime for LLaMA model in pure C/C++.

Description

The main goal is to run the model using 4-bit quantization on a laptop.

  • Plain C/C++ implementation without dependencies.
  • Apple silicon first-class citizen - optimized via ARM NEON.
  • AVX2 support for x86 architectures.
  • Mixed F16 / F32 precision.
  • 4-bit quantization support.
  • Runs on the CPU.

Usage

Build instruction follows.

cmake -S . -B build/release
cmake --build build/release
ln -s build/release/llama/cc/_llama.cpython-310-x86_64-linux-gnu.so llama

Obtain the original LLaMA model weights and place them in data/model directory.

python -m llama pull -m data/model/7B -s 7B

As model weights are successfully fetched, directory structure should look like below.

data/model
├── 7B
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   └── params.json
├── tokenizer_checklist.chk
└── tokenizer.model

Then one should convert the 7B model to ggml FP16 format.

python -m llama convert data/model/7B

And quantize the model to 4-bits.

python -m llama quantize data/model/7B

Then one can start Python interpreter and play with naked bindings.

from llama._llama import *

nothreads = 8
model = LLaMA.load('./data/model/7B/ggml-model-q4_0.bin', 512, GGMLType.F32)
mem_per_token = model.estimate_mem_per_token(nothreads)
logits = model.apply(context, context_size, mem_per_token, nothreads)

token_id = sample_next_token(context, logits)

tokenizer = model.get_tokenizer()
tokenizer.decode(token_id)

Or run CLI interface.

Memory/Disk Requirements

As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.

model original size quantized size (4-bit)
7B 13 GB 3.9 GB
13B 24 GB 7.8 GB
30B 60 GB 19.5 GB
65B 120 GB 38.5 GB

llama.py's People

Contributors

ameobea avatar antimatter15 avatar anzz1 avatar beiller avatar bengarney avatar bernatvadell avatar bitrake avatar blackhole89 avatar daskol avatar eiz avatar etra0 avatar fabio3rs avatar ggerganov avatar glinscott avatar green-sky avatar hoangmit avatar j3k0 avatar jcelerier avatar jooray avatar kallisti5 avatar kevlo avatar marckohlbrugge avatar mqy avatar prusnak avatar razelighter777 avatar ronsor avatar siraben avatar sw avatar tiendung avatar tjohnman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.