b4rtaz / distributed-llama Goto Github PK

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.

License: MIT License

Makefile 0.91% C++ 89.63% Python 9.45%

distributed-computing llama2 llm llm-inference neural-network llms open-llm distributed-llm llama3

distributed-llama's Introduction

Distributed Llama

Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.

_{^{Distributed Llama running Llama 2 70B on 8 Raspberry Pi 4B devices}}

Supported models:

Llama 2 (7B, 13B, 70B) chat and non-chat versions,
Llama 3,
Grok-1 (314B).

Known limitations:

You can run Distributed Llama only on 1, 2, 4... 2^n devices.
Optimized for (weights format × buffer format):
- ARM CPUs
  - ✅ F32 × F32
  - ❌ F16 × F32
  - ❌ Q40 × F32
  - ✅ Q40 × Q80
- x86_64 AVX2 CPUs
  - ❌ F32 × F32
  - ❌ F16 × F32
  - ❌ Q40 × F32
  - ⚠️ Q40 × Q80 (partial optimization)

Architecture
The project is split up into two parts:

Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.

You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.

📊 Measurements

Average Single Token Generation Time

All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16.

Raspberry Pi 4B 8 GB

_{^{8 x Raspberry Pi 4B 8GB}}

All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch.

Model	1 x RasPi 4B 8 GB	2 x RasPi 4B 8 GB	4 x RasPi 4B 8 GB	8 x RasPi 4B 8 GB
Llama 2 7B	1312.50 ms _{^{(I: 1307.94 ms, T: 1.81 ms)}}	793.69 ms _{^{(I: 739.00 ms, T: 52.50 ms)}}	494.00 ms 🔥 _{^{(I: 458.81 ms, T: 34.06 ms)}}	588.19 ms _{^{(I: 296.69 ms, T: 289.75 ms)}}
Llama 2 13B	_{^{Not enough RAM}}	1497.19 ms _{^{(I: 1465.06 ms, T: 30.88 ms)}}	848.19 ms 🔥 _{^{(I: 746.88 ms, T: 99.50 ms)}}	1114.88 ms _{^{(I: 460.8 ms, T: 652.88 ms)}}
Llama 2 70B	_{^{Not enough RAM}}	_{^{Not enough RAM}}	_{^{Not enough RAM}}	4842.81 ms 🔥 _{^{(I: 2121.94 ms, T: 2719.62 ms)}}

_{^{I - inference time of the root node, T - network transfer time}}

Raspberry Pi 5 8GB

Model	1 x RasPi 5 8 GB
Llama 2 7B	436.25 ms _{^{(I: 433.31 ms, T: 2.19 ms) by @segabor}}

_{^{I - inference time of the root node, T - network transfer time}}

x86_64 CPU Cloud Server

All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. More details.

Model	1 x VM	2 x VM	4 x VM
Llama 2 7B	101.81 ms _{^{(I: 101.06 ms, T: 0.19 ms)}}	69.69 ms _{^{(I: 61.50 ms, T: 7.62 ms)}}	53.69 ms 🔥 _{^{(I: 40.25 ms, T: 12.81 ms)}}
Llama 2 13B	184.19 ms _{^{(I: 182.88 ms, T: 0.69 ms)}}	115.38 ms _{^{(I: 107.12 ms, T: 7.81 ms)}}	86.81 ms 🔥 _{^{(I: 66.25 ms, T: 19.94 ms)}}
Llama 2 70B	909.69 ms _{^{(I: 907.25 ms, T: 1.75 ms)}}	501.38 ms _{^{(I: 475.50 ms, T: 25.00 ms)}}	293.06 ms 🔥 _{^{(I: 264.00 ms, T: 28.50 ms)}}

_{^{I - inference time of the root node, T - network transfer time}}

Network Transfer for Generating Single Token

F32 Buffer

Model	2 devices	4 devices	8 devices
Llama 2 7B	4192 kB _{^{(S: 2224 kB, R: 1968 kB)}}	10656 kB _{^{(S: 7704 kB, R: 2952 kB)}}	22624 kB _{^{(S: 19180 kB, R: 3444 kB)}}
Llama 2 13B	6560 kB _{^{(S: 3480 kB, R: 3080 kB)}}	16680 kB _{^{(S: 12060 kB, R: 4620 kB)}}	35420 kB _{^{(S: 30030 kB, R: 5390 kB)}}
Llama 2 70B

_{^{S - sent data from the root node to workers, R - received data by the root node from workers}}

Q80 Buffer

Model	2 devices	4 devices	8 devices
Llama 2 7B	1112 kB _{^{(S: 590 kB, R: 522 kB)}}	2830 kB _{^{(S: 2046 kB, R: 784 kB)}}	6008 kB _{^{(S: 5094 kB, R: 914 kB)}}
Llama 2 13B	1742 kB _{^{(S: 924 kB, R: 818 kB)}}	4430 kB _{^{(S: 3203 kB, R: 1227 kB)}}	9407 kB _{^{(S: 7976 kB, R: 1431 kB)}}
Llama 2 70B	5525 kB _{^{(S: 3230 kB, R: 2295 kB)}}	14917 kB _{^{(S: 11475 kB, R: 3442 kB)}}	32873 kB _{^{(S: 28857 kB, R: 4016 kB)}}

_{^{S - sent data from the root node to workers, R - received data by the root node from workers}}

Download Model and Run

📟 How to Run on Raspberry Pi Devices

Install Raspberry Pi OS Lite (64 bit) on your Raspberry Pi devices. This OS doesn't have desktop environment.
Connect all devices to the Gigabit switch.
Connect to all devices via SSH.

ssh [email protected]
ssh [email protected]

Install Git:

sudo apt install git

Clone this repository:

git clone https://github.com/b4rtaz/distributed-llama.git

Compile Distributed Llama:

make main

Transfer weights and the tokenizer file to the root device.
Optional: assign static IP addresses.

sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device

Run worker nodes on worker devices:

sudo nice -n -20 ./main worker --port 9998 --nthreads 4

Run root node on the root device:

sudo nice -n -20 ./main inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998

To add more worker nodes, just add more addresses to the --workers argument.

./main inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

Share your results!

💻 How to Run on MacOS or Linux

You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution or macOS.

Install Git and G++:

sudo apt install git build-essential

Clone this repository:

git clone https://github.com/b4rtaz/distributed-llama.git

Compile Distributed Llama:

make main

Transfer weights and the tokenizer file to the root node.
Run worker nodes on worker devices:

sudo nice -n -20 ./main worker --port 9998 --nthreads 4

Run root node on the root device:

sudo nice -n -20 ./main inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998

To run the root node in the chat mode:

sudo nice -n -20 ./main chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998

Share your results!

💡 License

This project is released under the MIT license.

📖 Citation

@misc{dllama,
  author = {Bartłomiej Tadych},
  title = {Distributed Llama},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
  commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}

distributed-llama's People

Contributors

Stargazers

Watchers

distributed-llama's Issues

Can I use Ollama model

Sorry if asking a stupid question. I am newbie llm user, have been using Ollama and love its ease of use. However, I also have limit on the hardware. distributed-llama seem a promising solution for me. But I don't know how to use those model provided by Ollama. o is it feasible at all?

converter.py OOM while converting llama-2-7b weights on my Raspberryi Pi 5

What's the memory requirement of the weight converter? Apparently it doesn't fit into 8 GB RAM (swapfile not enabled).

[Feature Suggest] Tensor Parallellism for Accelerating LLM

Dear Author,

Your contribution is critical for the open-source community. The distributed-llama repo has implemented tensor parallelism from scratch. And the result is amazingly significant. However, there are still improvements that could be made. Because of my poor coding ability, not able to make improvements myself, I hope you can look at my suggestions below.

Challenge: root node's special task and synchronization

When I run the repo version '0.1.0', I find that the softmax operations in MultiHead are conducted on the root node only. This operation costs a significant portion of the total time. Second, the synFfnA and synFfn2 functions also cost a lot of time.

Mature solutions

In fact, these challenges have been found in this paper: https://arxiv.org/abs/1909.08053. Its solution is shown in the image:

It conducts attention mechanism(softmax) on every worker. Second, the matrix segmentation direction is using column segment and row segment in two consecutive matrices, thus reducing to one synchronization operation instead of two.

If you are willing to make further improvements to the repo, the following is the mature solution for every component of llama2 using tensor parallelism and sequence parallelism.
https://pytorch.org/tutorials/intermediate/TP_tutorial.html
However, it's implemented in Python, and you will be the first one to implement the solution in C++.

Thanks for your contribution!!!
Best Regards

Will this awesome proj consider supporting GPU acceleration？

A very impressive job!

But it doesn't seem to support the use of GPU. Does the author consider developing code that supports GPU acceleration?

Any suggestions to migrate this project to CUDA/HIP acceleration?

Thanks for any help!

WebAssembly version

Is it in the scope of the project to eventually provide a WebAssembly version?

Compiling error related to include of <ctime>

I got this error while compiling:

src/socket.cpp:61:34: error: ‘time’ was not declared in this scope
   61 |                     time_t now = time(NULL);
      |                                  ^~~~
src/socket.cpp:12:1: note: ‘time’ is defined in header ‘<ctime>’; did you forget to ‘#include <ctime>’?
   11 | #include <stdexcept>
  +++ |+#include <ctime>
   12 |

Adding #include <ctime> to src/socket.cpp fixed the compiling error for me. Please note that I do not code in C++, so please forgive any ignorance I have on this issue.

Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases？

Master process crashes running out of memory on a 8 GB RPi 5

I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size.
When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection.
Any idea why? I tried with the smallest, llama-2-7b model.

Terminal capture from master

segabor@bigfive:~/src/distributed-llama $ ./run.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 2
./run.sh: line 9: 268004 Segmentation fault      sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998

Worker capture:

^Csegabor@lohere:~/src/distributed-llama$ ./run.sh 
Listening on 0.0.0.0:9998...
Client connected
💡 sliceIndex: 1
💡 nSlices: 2
⏩ Received 56918016 bytes for block 0 (83092 kB/s)
⏩ Received 56918016 bytes for block 1 (112709 kB/s)
⏩ Received 56918016 bytes for block 2 (112486 kB/s)
⏩ Received 56918016 bytes for block 3 (112709 kB/s)
⏩ Received 56918016 bytes for block 4 (91069 kB/s)
⏩ Received 56918016 bytes for block 5 (114986 kB/s)
⏩ Received 56918016 bytes for block 6 (103865 kB/s)
⏩ Received 56918016 bytes for block 7 (106190 kB/s)
⏩ Received 56918016 bytes for block 8 (112709 kB/s)
⏩ Received 56918016 bytes for block 9 (63172 kB/s)
⏩ Received 56918016 bytes for block 10 (63172 kB/s)
⏩ Received 56918016 bytes for block 11 (63313 kB/s)
⏩ Received 56918016 bytes for block 12 (63313 kB/s)
⏩ Received 56918016 bytes for block 13 (63172 kB/s)
⏩ Received 56918016 bytes for block 14 (60810 kB/s)
⏩ Received 56918016 bytes for block 15 (64097 kB/s)
⏩ Received 56918016 bytes for block 16 (60551 kB/s)
⏩ Received 56918016 bytes for block 17 (60358 kB/s)
⏩ Received 56918016 bytes for block 18 (60423 kB/s)
⏩ Received 56918016 bytes for block 19 (61600 kB/s)
⏩ Received 56918016 bytes for block 20 (62205 kB/s)
⏩ Received 56918016 bytes for block 21 (61136 kB/s)
⏩ Received 56918016 bytes for block 22 (62138 kB/s)
⏩ Received 56918016 bytes for block 23 (64753 kB/s)
⏩ Received 56918016 bytes for block 24 (100208 kB/s)
⏩ Received 56918016 bytes for block 25 (112486 kB/s)
⏩ Received 56918016 bytes for block 26 (112486 kB/s)
⏩ Received 56918016 bytes for block 27 (114064 kB/s)
⏩ Received 56918016 bytes for block 28 (111823 kB/s)
⏩ Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed

Turing RK1 compute module results

I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it.
Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD
OS: custom Ubuntu Server
Model: llama-2-7b

Command

sudo nice -n -20 ./main inference \
  --model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

Result

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  372 ms I  372 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  378 ms I  378 ms T    0 ms S      0 kB R      0 kB  world
🔶 G  369 ms I  367 ms T    1 ms S      0 kB R      0 kB ,
🔶 G  379 ms I  379 ms T    0 ms S      0 kB R      0 kB  I
🔶 G  424 ms I  397 ms T   27 ms S      0 kB R      0 kB '
🔶 G  376 ms I  376 ms T    0 ms S      0 kB R      0 kB m
🔶 G  378 ms I  377 ms T    0 ms S      0 kB R      0 kB  E
🔶 G  407 ms I  407 ms T    0 ms S      0 kB R      0 kB .
🔶 G  383 ms I  380 ms T    0 ms S      0 kB R      0 kB  січня
🔶 G  372 ms I  371 ms T    1 ms S      0 kB R      0 kB  
🔶 G  379 ms I  378 ms T    0 ms S      0 kB R      0 kB 2
🔶 G  374 ms I  373 ms T    0 ms S      0 kB R      0 kB 0
🔶 G  382 ms I  381 ms T    0 ms S      0 kB R      0 kB 1
🔶 G  375 ms I  373 ms T    2 ms S      0 kB R      0 kB 8
🔶 G  378 ms I  377 ms T    1 ms S      0 kB R      0 kB  at
🔶 G  382 ms I  382 ms T    0 ms S      0 kB R      0 kB  
Generated tokens:    16
Avg generation time: 381.75 ms
Avg inference time:  379.25 ms
Avg transfer time:   2.00 ms

Assertion `d % nSlices == 0' failed.

I'm inferencing a q40 weight of llama-3-70b-instruct across 3 x86_64 machines, and I'm getting this on my root node:
main: src/transformer.cpp:17: MatmulSlice::MatmulSlice(FloatType, int, int, int): Assertion `d % nSlices == 0' failed.

Any suggestions?

Need help in set up all the devices

Hi Mr Bart,

Your distributed-llama is great! However, are there any clear instructions to set up the whole environment from scratch? I'm interested in your distributed-llama but I lack related knowledge in Raspberry Pi devices. So I don't even know how to connect the devices into my PC or install some dependencies in my PC or Raspberry Pi devices. Could you please help me with it? Thank you so much!

To support Hugging Face model

When download the model from Meta website using email url, there are often get network problems 403 Forbidden. Is there any support for hugging face model possible ? Thanks !

b4rtaz / distributed-llama Goto Github PK

distributed-llama's Introduction

Distributed Llama

📊 Measurements

Average Single Token Generation Time

Network Transfer for Generating Single Token

Download Model and Run

📟 How to Run on Raspberry Pi Devices

💻 How to Run on MacOS or Linux

💡 License

📖 Citation

distributed-llama's People

Contributors

Stargazers

Watchers

Forkers

distributed-llama's Issues

Challenge: root node's special task and synchronization

Mature solutions

Command

Result

Recommend Projects

Recommend Topics

Recommend Org

Jobs