GithubHelp home page GithubHelp logo

kkangle / distributed-llama Goto Github PK

View Code? Open in Web Editor NEW

This project forked from b4rtaz/distributed-llama

0.0 0.0 0.0 1.28 MB

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.

License: MIT License

C++ 88.65% Python 5.18% TeX 5.43% Makefile 0.73%

distributed-llama's Introduction

Distributed Llama

Distributed Llama

GitHub Actions Workflow Status License: MIT X: b4rtaz

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.

Distributed Llama running on 8 Raspberry Pi 4B devices
Distributed Llama running on 8 Raspberry Pi 4B devices

This project was initiated based on the llama2.c repository. Big thanks to @karpathy and other contributors. Most ARM optimizations come from the llama.cpp project.

πŸ“ƒ Read the report

Known limitations

  • This project is a proof of concept, it's not optimized for production usage.
  • You can run Distributed Llama only on 1, 2, 4... 2^n devices.
  • The project supports only the inference mode, the chat mode is not supported.
  • Optimized for (weights format Γ— buffer format):
    • ARM CPUs
      • βœ… F32 Γ— F32
      • ❌ F16 Γ— F32
      • ❌ Q40 Γ— F32
      • βœ… Q40 Γ— Q80
    • x86_64 AVX2 CPUs
      • ❌ F32 Γ— F32
      • ❌ F16 Γ— F32
      • ❌ Q40 Γ— F32
      • ⚠️ Q40 Γ— Q80 (partial optimization)

Supported models

  • Llama 2 7B
  • Llama 2 13B
  • Llama 2 70B
  • Llama 2 compatible models

Architecture
The project is split up into two parts:

  • Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
  • Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.

You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.

πŸ“Š Measurements

Average Single Token Generation Time

All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16.

Raspberry Pi 4B 8 GB

8 x Raspberry Pi 4B 8GB
8 x Raspberry Pi 4B 8GB

All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch.

Model 1 x RasPi 4B 8 GB 2 x RasPi 4B 8 GB 4 x RasPi 4B 8 GB 8 x RasPi 4B 8 GB
Llama 2 7B 1312.50 ms
(I: 1307.94 ms, T: 1.81 ms)
793.69 ms
(I: 739.00 ms, T: 52.50 ms)
494.00 ms πŸ”₯
(I: 458.81 ms, T: 34.06 ms)
588.19 ms
(I: 296.69 ms, T: 289.75 ms)
Llama 2 13B Not enough RAM 1497.19 ms
(I: 1465.06 ms, T: 30.88 ms)
848.19 ms πŸ”₯
(I: 746.88 ms, T: 99.50 ms)
1114.88 ms
(I: 460.8 ms, T: 652.88 ms)
Llama 2 70B Not enough RAM Not enough RAM Not enough RAM 4842.81 ms πŸ”₯
(I: 2121.94 ms, T: 2719.62 ms)

I - inference time of the root node, T - network transfer time

x86_64 CPU Cloud Server

All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. More details.

Model 1 x VM 2 x VM 4 x VM
Llama 2 7B 101.81 ms
(I: 101.06 ms, T: 0.19 ms)
69.69 ms
(I: 61.50 ms, T: 7.62 ms)
53.69 ms πŸ”₯
(I: 40.25 ms, T: 12.81 ms)
Llama 2 13B 184.19 ms
(I: 182.88 ms, T: 0.69 ms)
115.38 ms
(I: 107.12 ms, T: 7.81 ms)
86.81 ms πŸ”₯
(I: 66.25 ms, T: 19.94 ms)
Llama 2 70B 909.69 ms
(I: 907.25 ms, T: 1.75 ms)
501.38 ms
(I: 475.50 ms, T: 25.00 ms)
293.06 ms πŸ”₯
(I: 264.00 ms, T: 28.50 ms)

I - inference time of the root node, T - network transfer time

Network Transfer for Generating Single Token

F32 Buffer

Model 2 devices 4 devices 8 devices
Llama 2 7B 4192 kB
(S: 2224 kB, R: 1968 kB)
10656 kB
(S: 7704 kB, R: 2952 kB)
22624 kB
(S: 19180 kB, R: 3444 kB)
Llama 2 13B 6560 kB
(S: 3480 kB, R: 3080 kB)
16680 kB
(S: 12060 kB, R: 4620 kB)
35420 kB
(S: 30030 kB, R: 5390 kB)
Llama 2 70B

S - sent data from the root node to workers, R - received data by the root node from workers

Q80 Buffer

Model 2 devices 4 devices 8 devices
Llama 2 7B 1112 kB
(S: 590 kB, R: 522 kB)
2830 kB
(S: 2046 kB, R: 784 kB)
6008 kB
(S: 5094 kB, R: 914 kB)
Llama 2 13B 1742 kB
(S: 924 kB, R: 818 kB)
4430 kB
(S: 3203 kB, R: 1227 kB)
9407 kB
(S: 7976 kB, R: 1431 kB)
Llama 2 70B 5525 kB
(S: 3230 kB, R: 2295 kB)
14917 kB
(S: 11475 kB, R: 3442 kB)
32873 kB
(S: 28857 kB, R: 4016 kB)

S - sent data from the root node to workers, R - received data by the root node from workers

πŸ”¨ How to Convert Llama 2 Weights

  1. Download Llama 2 weights from Meta. This project supports 7B, 13B and 70B models. This project doesn't support chat models.
  2. Open the llama-2-7b/params.json file and replace "vocab_size": -1 to "vocab_size": 32000.
  3. Install dependencies of the converter:
cd converter && pip install -r requirements.txt
  1. Convert weights to Distributed Llama format. This will take a bit of time. The script requires Python 3.
python convert-llama2.py /path/to/meta/llama-2-7b q40

In the table below, you can find the expected size of the converted weights with different floating-point types.

Model Original size Float32 Float16 Q40
Llama 2 7B 13.48 GB 25.10GB 3.95 GB
Llama 2 13B 26.03 GB 7.35 GB
Llama 2 70B 137.97 GB 36.98 GB

πŸ”¨ How to Convert .bin Weights

You can convert weights compatible with llama2.c to the Distributed Llama format. The legacy converter converts weights only to Float32 format.

  1. Download weights.
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
  1. Install dependencies of the converter:
cd converter && pip install -r requirements.txt
  1. Convert weights to Distributed Llama format.
python convert-legacy.py stories42M.bin true

πŸ“Ÿ How to Run on Raspberry Pi Devices

  1. Install Raspberry Pi OS Lite (64 bit) on your Raspberry Pi devices. This OS doesn't have desktop environment.
  2. Connect all devices to the Gigabit switch.
  3. Connect to all devices via SSH.
  1. Install Git:
sudo apt install git
  1. Clone this repository:
git clone https://github.com/b4rtaz/distributed-llama.git
  1. Compile Distributed Llama:
make main
  1. Download the tokenizer.bin file from the llama2.c repository to the root device.
wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
  1. Transfer converted weights to the root device.
  2. Optional: assign static IP addresses.
sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device
  1. Run worker nodes on worker devices:
sudo nice -n -20 ./main worker --port 9998 --nthreads 4
  1. Run root node on the root device:
sudo nice -n -20 ./main inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998

To add more worker nodes, just add more addresses to the --workers argument.

./main inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

Share your results!

πŸ’» How to Run on MacOS or Linux

You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution or macOS.

  1. Install Git and G++:
sudo apt install git build-essential
  1. Clone this repository:
git clone https://github.com/b4rtaz/distributed-llama.git
  1. Compile Distributed Llama:
make main
  1. Download the tokenizer.bin file from the llama2.c repository.
wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
  1. Download converted weights from your Google Drive. To get the file ID you need to share the file ("Anyone with the link") and copy the ID from the URL.
sudo apt install python pip
pip install gdown
gdown https://drive.google.com/uc?id=<FILE_ID>
  1. Run worker nodes on worker devices:
sudo nice -n -20 ./main worker --port 9998 --nthreads 4
  1. Run worker nodes on worker devices:
sudo nice -n -20 ./main inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998

Share your results!

πŸ’‘ License

This project is released under the MIT license.

πŸ“– Citation

@misc{dllama,
  author = {BartΕ‚omiej Tadych},
  title = {Distributed Llama},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}

distributed-llama's People

Contributors

b4rtaz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.