It is mentioned on README that candle supports multi GPU inference, using NCCL under t

Please see <a href="https://github.com/huggingface/candle/blob/f48c07e2428a6d777ffdea5

Please see <a href="https://github.com/huggingface/candle/blob/f48c07e242

How to run inference of a (very) large model across mulitple GPUs ? about candle HOT 2 OPEN

jorgeantonio21 commented on July 21, 2024

How to run inference of a (very) large model across mulitple GPUs ?

from candle.

Comments (2)

EricLBuehler commented on July 21, 2024 1

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

candle/candle-examples/examples/llama_multiprocess/model.rs

Lines 293 to 308 in f48c07e

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {

let qkv_proj = TensorParallelColumnLinear::load_multi(

vb.clone(),

&["q_proj", "k_proj", "v_proj"],

comm.clone(),

)?;

let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;

Ok(Self {

qkv_proj,

o_proj,

num_attention_heads: cfg.num_attention_heads / comm.world_size(),

num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),

head_dim: cfg.hidden_size / cfg.num_attention_heads,

cache: cache.clone(),

})

}

from candle.

b0xtch commented on July 21, 2024

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

candle/candle-examples/examples/llama_multiprocess/model.rs

Lines 293 to 308 in f48c07e

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {

let qkv_proj = TensorParallelColumnLinear::load_multi(

vb.clone(),

&["q_proj", "k_proj", "v_proj"],

comm.clone(),

)?;

let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;

Ok(Self {

qkv_proj,

o_proj,

num_attention_heads: cfg.num_attention_heads / comm.world_size(),

num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),

head_dim: cfg.hidden_size / cfg.num_attention_heads,

cache: cache.clone(),

})

}

That example is for a single node. How about multiple nodes? Can we just run the example with mpirun -n 2 --hostfile ../../hostfile target/release/llama_multiprocess 2 2000

Update:

I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication?

Found this library https://github.com/oddity-ai/async-cuda

from candle.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

How to run inference of a (very) large model across mulitple GPUs ? about candle HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {
	let qkv_proj = TensorParallelColumnLinear::load_multi(
	vb.clone(),
	&["q_proj", "k_proj", "v_proj"],
	comm.clone(),
	)?;
	let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;
	Ok(Self {
	qkv_proj,
	o_proj,
	num_attention_heads: cfg.num_attention_heads / comm.world_size(),
	num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),
	head_dim: cfg.hidden_size / cfg.num_attention_heads,
	cache: cache.clone(),
	})
	}