GithubHelp home page GithubHelp logo

nap1ch / codebear Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hjchen-thu/codebear

0.0 0.0 0.0 1.23 MB

Codebear: A fast and memory efficient code completion system based on CodeLlama

Shell 1.28% Python 98.72%

codebear's Introduction

Codebear

This repository combines GPTQ 4-bit quantization and Speculative Decoding to accelerate Large Language Models' (LLM) inference for code completion tasks in personal usage scenarios (where GPU resources are limited yet there's a pursuit for better performance and faster speed with larger models).

GPTQ is a one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly efficient. And Speculative Decoding is a innovative sampling strategy by using a small approximation model to propose sequences of tokens that will later be checked by a larger model.

By combining these two techniques, one can even deploy multiple LLMs in a single GPU with limited HBM memory usage. While benefiting from the improved performance brought by larger models, it also helps to accelerate inference speed to some extent.

The flowing figures are tested in a single V100(32GB) by deploying CodeLlama-34B and CodeLlama-7B models, with triton-based QuantLinear backend.

3 prefill + 200th decoding
Memory Usage(GB) 27.7
3 prefill + 200th decoding CodeLlama 7B(FP16) CodeLlama 7B(4Bit) CodeLlama 34B(4Bit) Speculative 7B+34B(4Bit)
Inference Speed(Tokens/sec) 14.3 34.1 7.9 9.4

alt text

Update

content
2024-03-10 fused mlp triton kernel

Acknowledgement

  • Special thanks to feifeibear for releasing the implemention of speculative decoding with both Google's and Deepmind's versions(LLMSpeculativeSampling).
  • Special thanks to AutoGPTQ team for implementing GPTQ algorithm and open source the code.

Quick Tour

Requirements

triton==2.1.0
auto_gptq==0.7.0
transformers==4.37.2

Step1: Quantize

Download the float model from official(CodeLlama-7B and CodeLlama-34B), then quantize them.

#quntize the 7b model
./make_quant.sh -f /PATH/TO/7B/FLOAT/MODEL -q /PATH/TO/7B/QUANT/MODEL
#quntize the 34b model
./make_quant.sh -f /PATH/TO/34B/FLOAT/MODEL -q /PATH/TO/34B/QUANT/MODEL

Or you can just download the 4bit quantized model from my Huggingface(CodeLlama-7B-4bit and CodeLlama-34B-4bit)

The basic config of quantization is set to bits = 4, group_num = 128 (can be changed in ./scripts/quantize.py).

Step2: Serving

Start serving

./start_server.sh -s /PATH/TO/7B/QUANT/MODEL -l /PATH/TO/34B/QUANT/MODEL -t /PATH/TO/7B/FLOAT/MODEL

Default sampling params are set to max_tokens = 200, top_k = 10, top_p = 0.9 (can be changed in ./scripts/serving.py).

Send request (the model is specially trained for code completion with python)

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "def quicksort("}' http://127.0.0.1:5000/codebear

alt text

Future Plans

Progress
fused_flash_attn_MHA triton implemention todo
fused_flash_attn_GQA triton implemention todo
INT8 KV cache todo

References

@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}

@inproceedings{leviathan2023fast,
  title={Fast inference from transformers via speculative decoding},
  author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
  booktitle={International Conference on Machine Learning},
  pages={19274--19286},
  year={2023},
  organization={PMLR}
}

codebear's People

Contributors

hjchen-thu avatar runner-2019 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.