Comments (12)
It's certainly early days for this project so the documentation is scarse and the apis are likely to change. There is some ongoing work on the candle book but it's also a work in progress.
When it comes to your issue, the model implementation looks reasonable. Maybe the issue is with how you set the optimizer, you have to pass it the variables used in your model, so if you have a fully reproducible example this would make diagnosing this easier.
I would also suggest looking at the mnist training example for inspiration.
from candle.
Update Readme Section
from candle.
Thanks @LaurentMazare. I am trying to build a few simple tutorials which could onboard newcomers to candle framework.
This is my optimizer code. Can you review it by any chance ?
let varmap = VarMap::new();
let optimizer = candle_nn::SGD::new(varmap.all_vars(), 0.003);
for _ in 0..100 {
let (x,y) = get_batch(&train_data);
let xb: Tensor = Tensor::new(&x,&Device::Cpu).unwrap();
let yb: Tensor = Tensor::new(&y,&Device::Cpu).unwrap();
let (_, loss) = m.forward(&xb, &yb);
optimizer.backward_step(&loss).unwrap();
println!("loss: {:?}", loss);
}
I am picking up xb and yb randomly.
fn get_batch(data: &[u32]) -> ([[u32; BLOCK_SIZE]; BATCH_SIZE], [[u32; BLOCK_SIZE]; BATCH_SIZE]) {
let mut rng = ChaCha8Rng::seed_from_u64(SEED);
let mut xx = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
let mut yy = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
for batch_index in 0..BATCH_SIZE {
let start = rng.gen_range(0..data.len() - BLOCK_SIZE);
for block_index in 0..BLOCK_SIZE {
xx[batch_index][block_index] = data[start + block_index];
yy[batch_index][block_index] = data[start + block_index + 1];
}
}
(xx, yy)
}
from candle.
Here is a script that should work based on your code. The tricky bit was that the optimizer should take as input a VarMap
that contains the variable to optimize. So you want to use the VarMap
struct that was also used to create the embedding layer.
Note that I've also removed the unwrap in favor of propagating the errors. I would suggest as possible improvements:
- Using
AdamW
rather thanSGD
might make tweaking the learning rate easier. - Maybe
get_bach
could be a rust iterator?
use anyhow::Result;
use candle::{DType, Device, Tensor};
use candle_nn::{embedding, Embedding, VarBuilder, VarMap};
use rand::{Rng, SeedableRng};
pub const BATCH_SIZE: usize = 64;
pub const BLOCK_SIZE: usize = 128;
pub const VOCAB_SIZE: usize = 100;
pub const SEED: u64 = 299792458;
fn get_batch(
data: &[u32],
) -> (
[[u32; BLOCK_SIZE]; BATCH_SIZE],
[[u32; BLOCK_SIZE]; BATCH_SIZE],
) {
let mut rng = rand::rngs::StdRng::seed_from_u64(SEED);
let mut xx = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
let mut yy = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
for batch_index in 0..BATCH_SIZE {
let start = rng.gen_range(0..data.len() - BLOCK_SIZE);
for block_index in 0..BLOCK_SIZE {
xx[batch_index][block_index] = data[start + block_index];
yy[batch_index][block_index] = data[start + block_index + 1];
}
}
(xx, yy)
}
#[derive(Debug)]
pub struct BigramLanguageModel {
token_embedding_table: Embedding,
}
impl BigramLanguageModel {
// Constructor
pub fn new(vocab_size: usize, vb: VarBuilder) -> Result<Self> {
let token_embedding_table = embedding(vocab_size, vocab_size, vb)?;
Ok(BigramLanguageModel {
token_embedding_table,
})
}
// Forward pass
pub fn forward(&self, idx: &Tensor, targets: &Tensor) -> Result<(Tensor, Tensor)> {
let logits = self.token_embedding_table.forward(idx)?;
let shape = logits.shape().dims();
let logits = logits.reshape(&[shape[0] * shape[1], shape[2]])?;
println!("shape: {:?}", logits.shape());
println!("targets shape: {:?}", targets.shape().dims()[0]);
if targets.shape().dims()[0] != 1 {
let targets = targets.reshape(&[shape[0] * shape[1]])?;
let loss = candle_nn::loss::cross_entropy(&logits, &targets)?;
Ok((logits, loss))
} else {
let loss = Tensor::zeros((1, 1), DType::F32, &Device::Cpu)?;
Ok((logits, loss))
}
}
}
fn main() -> Result<()> {
let dev = Device::Cpu;
let varmap = VarMap::new();
let vb = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
let m = BigramLanguageModel::new(VOCAB_SIZE, vb)?;
let optimizer = candle_nn::SGD::new(varmap.all_vars(), 0.3);
let train_data = (0..1000).map(|i| i % VOCAB_SIZE as u32).collect::<Vec<_>>();
for _ in 0..100 {
let (x, y) = get_batch(&train_data);
let xb: Tensor = Tensor::new(&x, &dev)?;
let yb: Tensor = Tensor::new(&y, &dev)?;
let (_, loss) = m.forward(&xb, &yb)?;
optimizer.backward_step(&loss)?;
println!("loss: {:?}", loss);
}
Ok(())
}
from candle.
Thanks @LaurentMazare, this is very useful. AdamW is my next step. I just wanted build everything from scratch. So that different functions could be introduced at different steps. So rust iterator will be my next step.
I am curious about inspiration behind VarBuilder. Does it act as an efficient storage of neural weights even when scope of training function changes ?
Also, I am not seeing any multi core usage. How do I enable it ? This is my htop output during training.
from candle.
The VarBuilder
is used to provide a model with variables so you typically pass it to functions that create model components. These functions can then retrieve a variable if it's loaded from disk or generate a random variable if the model is to be initialized. Variables are kept together with their path, e.g. encoder.layer1.mlp.weight
that is used both when reading weights from disks or when saving trained weights to the disk.
Typical example derived from the mnist example:
use candle_nn::Linear;
// This is actually already in candle_nn::linear.
fn linear(in_dim: usize, out_dim: usize, vs: VarBuilder) -> Result<Linear> {
let ws = vs.get_or_init((out_dim, in_dim), "weight", candle_nn::init::ZERO)?;
let bs = vs.get_or_init(out_dim, "bias", candle_nn::init::ZERO)?;
Ok(Linear::new(ws, Some(bs)))
}
struct Mlp {
ln1: Linear,
ln2: Linear,
}
impl Mlp {
fn new(vs: VarBuilder) -> Result<Self> {
let ln1 = linear(IMAGE_DIM, 100, vs.pp("ln1"))?;
let ln2 = linear(100, LABELS, vs.pp("ln2"))?;
Ok(Self { ln1, ln2 })
}
fn forward(&self, xs: &Tensor) -> Result<Tensor> {
let xs = self.ln1.forward(xs)?;
let xs = xs.relu()?;
self.ln2.forward(&xs)
}
}
This means that you can create a new Mlp
by using a VarBuilder
that is backed by a file for inference, e.g.:
let weights = unsafe { candle::safetensors::MmapedFile::new(weights_filename)? };
let weights = weights.deserialize()?;
let vb = VarBuilder::from_safetensors(vec![weights], DType::F32, &dev);
let model = Mlp.new(vb)?;
Or use a VarBuilder
backed by a fresh VarMap
if you want to train a model via:
// For training
let varmap = VarMap::new();
let vb = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
let model = Mlp.new(vb)?;
...train...
varmap.save("mlp.safetensors")?;
The embedding layer is by default not multithreaded as it's usually more memory bounded than cpu bounded (we're likely to revisit this though when polishing things if it adds some performance). Matrix multiplication/convolutions and other intensive ops should be using multiple cores.
from candle.
Thank you @LaurentMazare. That makes sense.
Could you help me out with a few of more queries ?
- How would you use manual seed in randn initialization ?
- Assumming that
let x = Tensor::zeros( (B, T, C), DType::F32, &Device::Cpu)?;
, how would you change values atx.i((b,t))
? - Any chance you could prioritize MPS implementation ? There is a larger audience who will be moving from Llama.cpp and Rustformers/llm to this repo where they will be looking at training and inference at local device. Apple Silicon backend will definitely encourage those users in using this repo.
from candle.
- Manual seeding is a work in progress but is not available at the moment, we're trying to figure out a good way to do it (the rng should probably be part of the device but it's annoying to have devices that are not
Copy
and that you have to pass by reference so we have to think a bit more about it). - You cannot mutate tensors and it's on purpose, only variables can be mutated. Maybe you want to use something like
mask.where_cond(x, y)
? - We actually added support for accelerate over the last couple days, it's probably not as good as having full metal support but should already provide some good speed up. You can use it with
--features accelerate
, please let us know if you see anything weird going on with it as it's certainly less tested than the rest.
from candle.
Thanks @LaurentMazare. I will check it out.
One more question, are you using accelerate for BLAS, LAPACK or BNNS ? What kind of speed ups have you observed ?
from candle.
Apologies, I guess you are using accelerate-src to get this going. I will go into the implementation as I get more command on the framework. Should I close this issue ?
from candle.
Right, on my linux x86 box I see mkl being regulary 2x or 3x faster. With accelerate on the matmul
from cpu_benchmarks.rs
I saw a 5x acceleration but that may be a very specific case. Feel free to close this issue if the original question was answered (and obviously to open new ones if you have further/different questions or to re-open if you have more comments on the same topic)
from candle.
Thanks @LaurentMazare
from candle.
Related Issues (20)
- Quantized tensors load support with candle_nn::VarBuilder
- How to properly implement PT to safetensors conversion
- Casting F32 -> BF16 causes relatively large discrepancy
- linking with `link.exe` failed: exit code: 1120 HOT 2
- Mistral quantized example: "error: library kind `framework` is only supported on Apple targets" HOT 2
- BERT Safetensors variable mismatch HOT 1
- unrecognized feature for crate candle-core: metal HOT 2
- no method named `data` found for struct `image::Rgb` in the current scope method not found in `Rgb<u8>` HOT 2
- Regarding the issue of WASM operating speed
- Interactive mode for Mistral example HOT 10
- nvcc error while compiling "src/quantized.cu" HOT 2
- Matmul error introduced in #1884 HOT 2
- Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16 HOT 1
- quantized exmaple with CUDA Error: not a f64 F32(1e-5) HOT 2
- calculation result is incorrect on metal backend HOT 9
- no cuda implementation for rms-norm HOT 2
- Add docs for argmax_keepdim and specify what happens in the event of a tie
- Can't loop over model implementation based off examples more than N times (7-20+ it ends up breaking) HOT 12
- Update Installation Page for Windows Requirements
- candle tensor operations are bit slower than pytorch tensor operations HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from candle.