Hi, been dreaming of a project like this. Some questions: <ol di

[Setup] Multiple Apple Silicon Macs: Questions about distributed-llama HOT 1 OPEN

s04 commented on June 22, 2024

[Setup] Multiple Apple Silicon Macs: Questions

from distributed-llama.

Comments (1)

DifferentialityDevelopment commented on June 22, 2024

You just separate them with spaces like so:
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

You can also run several from the same IP, like so:
./dllama inference ... --workers 10.0.0.1:9996 10.0.0.1:9997 10.0.0.1:9998

As for 1. performance on workers that have unified memory would be faster due to their increased memory bandwidth.
The root node consumes a bit more memory than the workers so I'd use the 36gb macbook as the root node, though typically it divides the memory required to load the model by the amount of workers though the number of workers need to be a power of 2 so 2, 4, 8 workers etc.

Also it's worth experimenting with the number of threads you specify, in my case I have 6 cores and 12 threads, but I get the best performance by using 8 threads.

Larger models require more data transferred during each inference pass, something like Q80 Llama 70B might already hit the limits of gigabit ethernet, switching capacity of your ethernet switch also becomes a factor then.

from distributed-llama.

[Setup] Multiple Apple Silicon Macs: Questions about distributed-llama HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs