GithubHelp home page GithubHelp logo

qeternity / aphrodite-engine Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pygmalionai/aphrodite-engine

0.0 0.0 0.0 1.5 MB

PygmalionAI's large-scale inference engine

Home Page: https://pygmalion.chat

License: GNU Affero General Public License v3.0

Shell 0.59% C++ 1.01% Python 69.40% C 0.30% Cuda 28.62% Dockerfile 0.07%

aphrodite-engine's Introduction

Breathing Life into Language

aphrodite

Aphrodite is the official backend engine for PygmalionAI. It is designed to serve as the inference endpoint for the PygmalionAI website, and to allow serving the Pygmalion models to a large number of users with blazing fast speeds (thanks to FasterTransformer and vLLM).

Aphrodite builds upon and integrates the exceptional work from various projects.

The compute necessary for Aphrodite's development is provided by Arc Compute.

Features

  • Continuous Batching
  • Efficient K/V management with PagedAttention
  • Optimized CUDA kernels for improved inference
  • Quantization support via GPTQ, AWQ, and SqueezeLLM.
  • Distributed inference
  • Variety of sampling methods (Mirostat, Locally Typical Sampling, Tail-Free Sampling, etc)
  • 8-bit KV Cache for higher context lengths and throughput.

Quickstart

pip install aphrodite-engine

python -m aphrodite.endpoints.openai.api_server --model PygmalionAI/pygmalion-2-7b

This will create a OpenAI-compatible API server that can be accessed at port 2242 of the localhost. You can plug in the API into a UI that supports Kobold, such as SillyTavern.

Performance

Speeds vary with different GPUs, model sizes, quantization schemes, batch sizes, etc. Here are some baseline benchmarks conducted by requesting as many completions as possible from the API server. Keep in mind that these are the theoritical peak throughput with parallel decoding, with as high a batch size as possible. Per-request generation speed is a fraction of this, at 30-40 t/s.

Note

16bit models can achieve much higher throughput if they have access to more VRAM, either by using larger GPUs, or tensor parallelism over many GPUs. The numbers below are purely for output tokens.

Model Quantization GPU Throughput (output t/s)
Llama-2 7B None RTX 4090 2576.2
AWQ RTX 4090 3551.3
GPTQ RTX 4090 2919.1
SqueezeLLM RTX 4090 580.3
Mistral 7B None RTX 4090 5489.3
AWQ RTX 4090 4078.8
GPTQ RTX 4090 4516.2
SqueezeLLM RTX 4090 549.5

Requirements

  • Operating System: Linux (or WSL for Windows)
  • Python: at least 3.8

Build Requirements:

  • CUDA >=12

For supported GPUs, see here.

Installation

Usage

For usage, please refer to the wiki page for detailed instructions. Aphrodite provides many different options for LLM inference, so please read through the list of options here.

Notes

  1. By design, Aphrodite takes up 90% of your GPU's VRAM. If you're not serving an LLM at scale, you may want to limit the amount of memory it takes up. You can do this in the API example by launching the server with the --gpu-memory-utilization 0.6 (0.6 means 60%).

  2. You can view the full list of commands by running python -m aphrodite.endpoints.openai.api_server --help.

  3. Context Length extension via the RoPE method is supported for most models. Use the command-line flag --max-model-len to specify a desired context length and the engine will adjust the RoPE scaling accordingly.

  4. Please refer to the FAQ & Issues if you run into problems. If you don't find an answer there, please make an issue.

Acknowledgements

Aphrodite Engine would have not been possible without the phenomenal work of other open-source projects. Credits go to:

Contributing

Everyone is welcome to contribute. You can support the project by opening Pull Requests for new features, fixes, or general UX improvements.

aphrodite-engine's People

Contributors

alpindale avatar stefangliga avatar g4rg avatar 50h100a avatar city-unit avatar teargosling avatar karakarawitch avatar krisseck avatar recoveredapparatus avatar lostruins avatar n-galrion avatar henk717 avatar miku448 avatar official-elinas avatar sandwichdoge avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.