Comments (15)
- Could you run single instance on your RPi 4?
- Could you run single instance on your RPi 5?
from distributed-llama.
Could you confirm the size of your file with weights?
b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz staff 96 Dec 9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz staff 6310 Jan 7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz staff 7887097884 Jan 8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz staff 39706066972 Jan 8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz staff 4242882588 Jan 7 22:23 dllama_llama-2-7b_q40.bin
...
In your logs of the root node I don't see this part:
...
π‘ nSlices: 1
β© Loaded 4242882560 bytes
So it looks like the weights file doesn't have all bytes.
from distributed-llama.
Here's the list of the llama model 7b and the converted weights file
segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor 100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor 3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor 105 Jan 25 09:46 params.json
Thanks! Apparently the size of weights doesn't match the right file on your list. It's of 70b!
I'm going to close this ticket, no error!
from distributed-llama.
Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4
?
from distributed-llama.
I think your results are correct. I think the problem here is that you use devices with different processor speed.
Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn't split load depends on the processor speed.
You should observe much better results if you would use 2x RasPi5.
from distributed-llama.
from distributed-llama.
Both crashed with sigsegv indicating they ran out of memory.
The command I used on both devices
sudo nice -n -20 ./main inference \
--model ./dllama_llama-2-7b_q40.bin \
--tokenizer ./tokenizer.bin \
--weights-float-type q40 \
--buffer-float-type q80 \
--prompt "Hello world" \
--steps 16 \
--nthreads 4
from distributed-llama.
I've made the conversion again and it fixed the single run.
The latest run on my RPi 5 looks below
segabor@bigfive:~/src/distributed-llama $ ./run_single.sh
π‘ dim: 4096
π‘ hiddenDim: 11008
π‘ nLayers: 32
π‘ nHeads: 32
π‘ nKvHeads: 32
π‘ vocabSize: 32000
π‘ seqLen: 2048
π‘ nSlices: 1
β© Loaded 4242882560 bytes
πΆ G 2460 ms I 2458 ms T 0 ms S 0 kB R 0 kB Hello
πΆ G 2409 ms I 2409 ms T 0 ms S 0 kB R 0 kB world
πΆ G 2398 ms I 2397 ms T 0 ms S 0 kB R 0 kB ,
πΆ G 2400 ms I 2399 ms T 0 ms S 0 kB R 0 kB I
πΆ G 2433 ms I 2428 ms T 4 ms S 0 kB R 0 kB '
πΆ G 2406 ms I 2405 ms T 0 ms S 0 kB R 0 kB m
πΆ G 2438 ms I 2432 ms T 4 ms S 0 kB R 0 kB new
πΆ G 2403 ms I 2402 ms T 0 ms S 0 kB R 0 kB to
πΆ G 2405 ms I 2404 ms T 0 ms S 0 kB R 0 kB this
πΆ G 2407 ms I 2406 ms T 0 ms S 0 kB R 0 kB and
πΆ G 2453 ms I 2452 ms T 0 ms S 0 kB R 0 kB have
πΆ G 2408 ms I 2407 ms T 0 ms S 0 kB R 0 kB a
πΆ G 2411 ms I 2410 ms T 0 ms S 0 kB R 0 kB question
πΆ G 2416 ms I 2415 ms T 0 ms S 0 kB R 0 kB for
πΆ G 2416 ms I 2415 ms T 0 ms S 0 kB R 0 kB you
πΆ G 2448 ms I 2447 ms T 0 ms S 0 kB R 0 kB .
Generated tokens: 16
Avg generation time: 2419.44 ms
Avg inference time: 2417.88 ms
Avg transfer time: 0.50 ms
from distributed-llama.
@b4rtaz yes, threads are set to 4. But I realized the main
was unoptimized. After recompiled with -O3
, single-node run performed as below,
π‘ dim: 4096
π‘ hiddenDim: 11008
π‘ nLayers: 32
π‘ nHeads: 32
π‘ nKvHeads: 32
π‘ vocabSize: 32000
π‘ seqLen: 2048
π‘ nSlices: 1
β© Loaded 4242882560 bytes
πΆ G 420 ms I 418 ms T 0 ms S 0 kB R 0 kB Hello
πΆ G 495 ms I 491 ms T 4 ms S 0 kB R 0 kB world
πΆ G 410 ms I 407 ms T 2 ms S 0 kB R 0 kB !
πΆ G 414 ms I 413 ms T 0 ms S 0 kB R 0 kB The
πΆ G 410 ms I 409 ms T 0 ms S 0 kB R 0 kB new
πΆ G 453 ms I 444 ms T 8 ms S 0 kB R 0 kB year
πΆ G 414 ms I 412 ms T 1 ms S 0 kB R 0 kB is
πΆ G 447 ms I 442 ms T 4 ms S 0 kB R 0 kB upon
πΆ G 446 ms I 442 ms T 4 ms S 0 kB R 0 kB us
πΆ G 412 ms I 411 ms T 0 ms S 0 kB R 0 kB ,
πΆ G 448 ms I 444 ms T 4 ms S 0 kB R 0 kB and
πΆ G 413 ms I 412 ms T 0 ms S 0 kB R 0 kB as
πΆ G 449 ms I 448 ms T 0 ms S 0 kB R 0 kB always
πΆ G 452 ms I 448 ms T 4 ms S 0 kB R 0 kB ,
πΆ G 451 ms I 446 ms T 4 ms S 0 kB R 0 kB we
πΆ G 446 ms I 446 ms T 0 ms S 0 kB R 0 kB have
Generated tokens: 16
Avg generation time: 436.25 ms
Avg inference time: 433.31 ms
Avg transfer time: 2.19 ms
from distributed-llama.
Wow! Nice acceleration compared to 4B.
from distributed-llama.
Yeah, I'd expect some improvements on a successor board. I also tested the code with CM4's mounted with 8 GB RAM.
Initial results:
Model = Llama 2 7b | Single Node | 2 Nodes | 4 Nodes |
---|---|---|---|
Avg Gen Time | 448.00 | 748.94 | 491.38 |
Avg inference time | 442.06 | 259.94 | 166.44 |
Avg transfer time | 5.25 | 488.62 | 324.50 |
Where the master was my RPi 5 and the remaining workers were the CMs. Unfortunately I don't own more RPi 4 or CM4 modules so I wasn't able to test a 8 node system.
from distributed-llama.
A more robust technique would be like distributing workloads using k8s or similar orchestration. But it's another story. By the way I'm expecting a brand new Rockchip based SoC with 32 GB RAM arrive early next month. Once I get it I will post a single-node benchmark. https://turingpi.com/exciting-updates-on-turing-rk1-compute-module/
from distributed-llama.
A more robust technique would be like distributing workloads using k8s or similar orchestration.
In last days I tested a few configurations of VMs in Google Cloud. This is the best test so far.
Once I get it I will post a single-node benchmark
Cool!
from distributed-llama.
Hi all, just want to report some findings with multiple Pi 5 8GB nodes:
llama-2 7B | 1 Node | 2 Nodes | 4 Nodes |
Avg generation time | 419.56 ms | 297.10 ms | 241.50 ms |
Avg inference time | 412.76 ms | 254.48 ms | 163.24 ms |
Avg transfer time | 6.40 ms | 42.08 ms | 77.90 ms |
Surprised to see that going from 2 to 4 only yields such little improvements... Any thoughts?
from distributed-llama.
@Vrownie I think your results are correct, after you added 2 more devices you should expect close to 2x improvement (not 4x).
You can understand it in this way, if yo will assume that 1 device needs to perform 1000 operations, that means in the best scenario 2 devices need to perform 500 operations each (2x faster). But 4 devices need to perform 250 operations each (4x faster).
412.76 ms (1 node) / 254.48 ms (2 nodes) => 1.6 (close to 2)
412.76 ms (1 node) / 163.24 ms (4 node) => 2.5 (close to 4)
254.48 ms (2 node) / 163.24 ms (4 node) => 1.5 (close to 2)
Another factor is that, the root node always has a bit more calculation to perform than workers, so the execution time doesn't not decrease linearly.
from distributed-llama.
Related Issues (13)
- Turing RK1 compute module results HOT 4
- WebAssembly version HOT 1
- Can I use Ollama model HOT 1
- How about the multi-core support of stand-alone dual-socket motherboards? HOT 1
- Hi, do you know why the synchronization time from 4pi to 8pi suddenly increasesοΌ HOT 15
- Need help in set up all the devices
- Compiling error related to include of <ctime> HOT 1
- Assertion `d % nSlices == 0' failed. HOT 2
- [Feature Suggest] Tensor Parallellism for Accelerating LLM HOT 22
- To support Hugging Face model HOT 10
- Will this awesome proj consider supporting GPU accelerationοΌ HOT 1
- converter.py OOM while converting llama-2-7b weights on my Raspberryi Pi 5 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from distributed-llama.