I setup a single master-worker pair to experiment with distributed llama. The master i

Could you run single instance on your RPi 4? Could you run single insta

Could you confirm the size of your file with weights? <div class="snippet-clipboar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Master process crashes running out of memory on a 8 GB RPi 5 about distributed-llama HOT 15 CLOSED

b4rtaz commented on June 4, 2024

Master process crashes running out of memory on a 8 GB RPi 5

from distributed-llama.

Comments (15)

b4rtaz commented on June 4, 2024 1

Could you run single instance on your RPi 4?
Could you run single instance on your RPi 5?

from distributed-llama.

b4rtaz commented on June 4, 2024 1

Could you confirm the size of your file with weights?

b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz  staff           96 Dec  9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz  staff         6310 Jan  7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz  staff   7887097884 Jan  8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff  39706066972 Jan  8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff   4242882588 Jan  7 22:23 dllama_llama-2-7b_q40.bin
...

In your logs of the root node I don't see this part:

...
💡 nSlices: 1
⏩ Loaded 4242882560 bytes

So it looks like the weights file doesn't have all bytes.

from distributed-llama.

segabor commented on June 4, 2024 1

Here's the list of the llama model 7b and the converted weights file

segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor         100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor  3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor         105 Jan 25 09:46 params.json

Thanks! Apparently the size of weights doesn't match the right file on your list. It's of 70b!
I'm going to close this ticket, no error!

from distributed-llama.

b4rtaz commented on June 4, 2024 1

Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4?

from distributed-llama.

b4rtaz commented on June 4, 2024 1

I think your results are correct. I think the problem here is that you use devices with different processor speed.

Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn't split load depends on the processor speed.

You should observe much better results if you would use 2x RasPi5.

from distributed-llama.

segabor commented on June 4, 2024

rpi5_dmesg.log

from distributed-llama.

segabor commented on June 4, 2024

Both crashed with sigsegv indicating they ran out of memory.

The command I used on both devices

sudo nice -n -20 ./main inference \
  --model ./dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

from distributed-llama.

segabor commented on June 4, 2024

I've made the conversion again and it fixed the single run.

The latest run on my RPi 5 looks below

segabor@bigfive:~/src/distributed-llama $ ./run_single.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G 2460 ms I 2458 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G 2409 ms I 2409 ms T    0 ms S      0 kB R      0 kB  world
🔶 G 2398 ms I 2397 ms T    0 ms S      0 kB R      0 kB ,
🔶 G 2400 ms I 2399 ms T    0 ms S      0 kB R      0 kB  I
🔶 G 2433 ms I 2428 ms T    4 ms S      0 kB R      0 kB '
🔶 G 2406 ms I 2405 ms T    0 ms S      0 kB R      0 kB m
🔶 G 2438 ms I 2432 ms T    4 ms S      0 kB R      0 kB  new
🔶 G 2403 ms I 2402 ms T    0 ms S      0 kB R      0 kB  to
🔶 G 2405 ms I 2404 ms T    0 ms S      0 kB R      0 kB  this
🔶 G 2407 ms I 2406 ms T    0 ms S      0 kB R      0 kB  and
🔶 G 2453 ms I 2452 ms T    0 ms S      0 kB R      0 kB  have
🔶 G 2408 ms I 2407 ms T    0 ms S      0 kB R      0 kB  a
🔶 G 2411 ms I 2410 ms T    0 ms S      0 kB R      0 kB  question
🔶 G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  for
🔶 G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  you
🔶 G 2448 ms I 2447 ms T    0 ms S      0 kB R      0 kB .
Generated tokens:    16
Avg generation time: 2419.44 ms
Avg inference time:  2417.88 ms
Avg transfer time:   0.50 ms

from distributed-llama.

segabor commented on June 4, 2024

@b4rtaz yes, threads are set to 4. But I realized the main was unoptimized. After recompiled with -O3, single-node run performed as below,

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  420 ms I  418 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  495 ms I  491 ms T    4 ms S      0 kB R      0 kB  world
🔶 G  410 ms I  407 ms T    2 ms S      0 kB R      0 kB !
🔶 G  414 ms I  413 ms T    0 ms S      0 kB R      0 kB  The
🔶 G  410 ms I  409 ms T    0 ms S      0 kB R      0 kB  new
🔶 G  453 ms I  444 ms T    8 ms S      0 kB R      0 kB  year
🔶 G  414 ms I  412 ms T    1 ms S      0 kB R      0 kB  is
🔶 G  447 ms I  442 ms T    4 ms S      0 kB R      0 kB  upon
🔶 G  446 ms I  442 ms T    4 ms S      0 kB R      0 kB  us
🔶 G  412 ms I  411 ms T    0 ms S      0 kB R      0 kB ,
🔶 G  448 ms I  444 ms T    4 ms S      0 kB R      0 kB  and
🔶 G  413 ms I  412 ms T    0 ms S      0 kB R      0 kB  as
🔶 G  449 ms I  448 ms T    0 ms S      0 kB R      0 kB  always
🔶 G  452 ms I  448 ms T    4 ms S      0 kB R      0 kB ,
🔶 G  451 ms I  446 ms T    4 ms S      0 kB R      0 kB  we
🔶 G  446 ms I  446 ms T    0 ms S      0 kB R      0 kB  have
Generated tokens:    16
Avg generation time: 436.25 ms
Avg inference time:  433.31 ms
Avg transfer time:   2.19 ms

from distributed-llama.

b4rtaz commented on June 4, 2024

Wow! Nice acceleration compared to 4B.

from distributed-llama.

segabor commented on June 4, 2024

Yeah, I'd expect some improvements on a successor board. I also tested the code with CM4's mounted with 8 GB RAM.
Initial results:

Model = Llama 2 7b	Single Node	2 Nodes	4 Nodes
Avg Gen Time	448.00	748.94	491.38
Avg inference time	442.06	259.94	166.44
Avg transfer time	5.25	488.62	324.50

Where the master was my RPi 5 and the remaining workers were the CMs. Unfortunately I don't own more RPi 4 or CM4 modules so I wasn't able to test a 8 node system.

from distributed-llama.

segabor commented on June 4, 2024

A more robust technique would be like distributing workloads using k8s or similar orchestration. But it's another story. By the way I'm expecting a brand new Rockchip based SoC with 32 GB RAM arrive early next month. Once I get it I will post a single-node benchmark. https://turingpi.com/exciting-updates-on-turing-rk1-compute-module/

from distributed-llama.

b4rtaz commented on June 4, 2024

A more robust technique would be like distributing workloads using k8s or similar orchestration.

In last days I tested a few configurations of VMs in Google Cloud. This is the best test so far.

Once I get it I will post a single-node benchmark

Cool!

from distributed-llama.

Vrownie commented on June 4, 2024

Hi all, just want to report some findings with multiple Pi 5 8GB nodes:

llama-2 7B	1 Node	2 Nodes	4 Nodes
Avg generation time	419.56 ms	297.10 ms	241.50 ms
Avg inference time	412.76 ms	254.48 ms	163.24 ms
Avg transfer time	6.40 ms	42.08 ms	77.90 ms

Surprised to see that going from 2 to 4 only yields such little improvements... Any thoughts?

from distributed-llama.

b4rtaz commented on June 4, 2024

@Vrownie I think your results are correct, after you added 2 more devices you should expect close to 2x improvement (not 4x).

You can understand it in this way, if yo will assume that 1 device needs to perform 1000 operations, that means in the best scenario 2 devices need to perform 500 operations each (2x faster). But 4 devices need to perform 250 operations each (4x faster).

412.76 ms (1 node) /  254.48 ms (2 nodes)  => 1.6 (close to 2)
412.76 ms (1 node) /  163.24 ms (4 node)  => 2.5 (close to 4) 
254.48 ms (2 node) /  163.24 ms (4 node)  => 1.5 (close to 2)

Another factor is that, the root node always has a bit more calculation to perform than workers, so the execution time doesn't not decrease linearly.

from distributed-llama.

Master process crashes running out of memory on a 8 GB RPi 5 about distributed-llama HOT 15 CLOSED

Comments (15)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs