GithubHelp home page GithubHelp logo

Comments (15)

b4rtaz avatar b4rtaz commented on June 4, 2024 1
  1. Could you run single instance on your RPi 4?
  2. Could you run single instance on your RPi 5?

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024 1

Could you confirm the size of your file with weights?

b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz  staff           96 Dec  9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz  staff         6310 Jan  7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz  staff   7887097884 Jan  8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff  39706066972 Jan  8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff   4242882588 Jan  7 22:23 dllama_llama-2-7b_q40.bin
...

In your logs of the root node I don't see this part:

...
πŸ’‘ nSlices: 1
⏩ Loaded 4242882560 bytes

So it looks like the weights file doesn't have all bytes.

from distributed-llama.

segabor avatar segabor commented on June 4, 2024 1

Here's the list of the llama model 7b and the converted weights file

segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor         100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor  3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor         105 Jan 25 09:46 params.json

Thanks! Apparently the size of weights doesn't match the right file on your list. It's of 70b!
I'm going to close this ticket, no error!

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024 1

Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4?

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024 1

I think your results are correct. I think the problem here is that you use devices with different processor speed.

Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn't split load depends on the processor speed.

You should observe much better results if you would use 2x RasPi5.

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

rpi5_dmesg.log

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

Both crashed with sigsegv indicating they ran out of memory.

The command I used on both devices

sudo nice -n -20 ./main inference \
  --model ./dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

I've made the conversion again and it fixed the single run.

The latest run on my RPi 5 looks below

segabor@bigfive:~/src/distributed-llama $ ./run_single.sh 
πŸ’‘ dim: 4096
πŸ’‘ hiddenDim: 11008
πŸ’‘ nLayers: 32
πŸ’‘ nHeads: 32
πŸ’‘ nKvHeads: 32
πŸ’‘ vocabSize: 32000
πŸ’‘ seqLen: 2048
πŸ’‘ nSlices: 1
⏩ Loaded 4242882560 bytes
πŸ”Ά G 2460 ms I 2458 ms T    0 ms S      0 kB R      0 kB Hello
πŸ”Ά G 2409 ms I 2409 ms T    0 ms S      0 kB R      0 kB  world
πŸ”Ά G 2398 ms I 2397 ms T    0 ms S      0 kB R      0 kB ,
πŸ”Ά G 2400 ms I 2399 ms T    0 ms S      0 kB R      0 kB  I
πŸ”Ά G 2433 ms I 2428 ms T    4 ms S      0 kB R      0 kB '
πŸ”Ά G 2406 ms I 2405 ms T    0 ms S      0 kB R      0 kB m
πŸ”Ά G 2438 ms I 2432 ms T    4 ms S      0 kB R      0 kB  new
πŸ”Ά G 2403 ms I 2402 ms T    0 ms S      0 kB R      0 kB  to
πŸ”Ά G 2405 ms I 2404 ms T    0 ms S      0 kB R      0 kB  this
πŸ”Ά G 2407 ms I 2406 ms T    0 ms S      0 kB R      0 kB  and
πŸ”Ά G 2453 ms I 2452 ms T    0 ms S      0 kB R      0 kB  have
πŸ”Ά G 2408 ms I 2407 ms T    0 ms S      0 kB R      0 kB  a
πŸ”Ά G 2411 ms I 2410 ms T    0 ms S      0 kB R      0 kB  question
πŸ”Ά G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  for
πŸ”Ά G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  you
πŸ”Ά G 2448 ms I 2447 ms T    0 ms S      0 kB R      0 kB .
Generated tokens:    16
Avg generation time: 2419.44 ms
Avg inference time:  2417.88 ms
Avg transfer time:   0.50 ms

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

@b4rtaz yes, threads are set to 4. But I realized the main was unoptimized. After recompiled with -O3, single-node run performed as below,

πŸ’‘ dim: 4096
πŸ’‘ hiddenDim: 11008
πŸ’‘ nLayers: 32
πŸ’‘ nHeads: 32
πŸ’‘ nKvHeads: 32
πŸ’‘ vocabSize: 32000
πŸ’‘ seqLen: 2048
πŸ’‘ nSlices: 1
⏩ Loaded 4242882560 bytes
πŸ”Ά G  420 ms I  418 ms T    0 ms S      0 kB R      0 kB Hello
πŸ”Ά G  495 ms I  491 ms T    4 ms S      0 kB R      0 kB  world
πŸ”Ά G  410 ms I  407 ms T    2 ms S      0 kB R      0 kB !
πŸ”Ά G  414 ms I  413 ms T    0 ms S      0 kB R      0 kB  The
πŸ”Ά G  410 ms I  409 ms T    0 ms S      0 kB R      0 kB  new
πŸ”Ά G  453 ms I  444 ms T    8 ms S      0 kB R      0 kB  year
πŸ”Ά G  414 ms I  412 ms T    1 ms S      0 kB R      0 kB  is
πŸ”Ά G  447 ms I  442 ms T    4 ms S      0 kB R      0 kB  upon
πŸ”Ά G  446 ms I  442 ms T    4 ms S      0 kB R      0 kB  us
πŸ”Ά G  412 ms I  411 ms T    0 ms S      0 kB R      0 kB ,
πŸ”Ά G  448 ms I  444 ms T    4 ms S      0 kB R      0 kB  and
πŸ”Ά G  413 ms I  412 ms T    0 ms S      0 kB R      0 kB  as
πŸ”Ά G  449 ms I  448 ms T    0 ms S      0 kB R      0 kB  always
πŸ”Ά G  452 ms I  448 ms T    4 ms S      0 kB R      0 kB ,
πŸ”Ά G  451 ms I  446 ms T    4 ms S      0 kB R      0 kB  we
πŸ”Ά G  446 ms I  446 ms T    0 ms S      0 kB R      0 kB  have
Generated tokens:    16
Avg generation time: 436.25 ms
Avg inference time:  433.31 ms
Avg transfer time:   2.19 ms

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024

Wow! Nice acceleration compared to 4B.

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

Yeah, I'd expect some improvements on a successor board. I also tested the code with CM4's mounted with 8 GB RAM.
Initial results:

Model = Llama 2 7b Single Node 2 Nodes 4 Nodes
Avg Gen Time 448.00 748.94 491.38
Avg inference time 442.06 259.94 166.44
Avg transfer time 5.25 488.62 324.50

Where the master was my RPi 5 and the remaining workers were the CMs. Unfortunately I don't own more RPi 4 or CM4 modules so I wasn't able to test a 8 node system.

from distributed-llama.

segabor avatar segabor commented on June 4, 2024

A more robust technique would be like distributing workloads using k8s or similar orchestration. But it's another story. By the way I'm expecting a brand new Rockchip based SoC with 32 GB RAM arrive early next month. Once I get it I will post a single-node benchmark. https://turingpi.com/exciting-updates-on-turing-rk1-compute-module/

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024

A more robust technique would be like distributing workloads using k8s or similar orchestration.

In last days I tested a few configurations of VMs in Google Cloud. This is the best test so far.

Once I get it I will post a single-node benchmark

Cool!

from distributed-llama.

Vrownie avatar Vrownie commented on June 4, 2024

Hi all, just want to report some findings with multiple Pi 5 8GB nodes:

llama-2 7B 1 Node 2 Nodes 4 Nodes
Avg generation time 419.56 ms 297.10 ms 241.50 ms
Avg inference time 412.76 ms 254.48 ms 163.24 ms
Avg transfer time 6.40 ms 42.08 ms 77.90 ms

Surprised to see that going from 2 to 4 only yields such little improvements... Any thoughts?

from distributed-llama.

b4rtaz avatar b4rtaz commented on June 4, 2024

@Vrownie I think your results are correct, after you added 2 more devices you should expect close to 2x improvement (not 4x).

You can understand it in this way, if yo will assume that 1 device needs to perform 1000 operations, that means in the best scenario 2 devices need to perform 500 operations each (2x faster). But 4 devices need to perform 250 operations each (4x faster).

412.76 ms (1 node) /  254.48 ms (2 nodes)  => 1.6 (close to 2)
412.76 ms (1 node) /  163.24 ms (4 node)  => 2.5 (close to 4) 
254.48 ms (2 node) /  163.24 ms (4 node)  => 1.5 (close to 2)

Another factor is that, the root node always has a bit more calculation to perform than workers, so the execution time doesn't not decrease linearly.

from distributed-llama.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.