Hello there, Newbie here, I am trying to reproduce "let's build GPT"

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Manual seeding is a work in progress but is not available at the moment, we're t

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Bigram Model about candle HOT 12 CLOSED

huggingface commented on May 10, 2024

Bigram Model

from candle.

Comments (12)

LaurentMazare commented on May 10, 2024

It's certainly early days for this project so the documentation is scarse and the apis are likely to change. There is some ongoing work on the candle book but it's also a work in progress.
When it comes to your issue, the model implementation looks reasonable. Maybe the issue is with how you set the optimizer, you have to pass it the variables used in your model, so if you have a fully reproducible example this would make diagnosing this easier.
I would also suggest looking at the mnist training example for inspiration.

from candle.

known-samy commented on May 10, 2024

Update Readme Section

from candle.

okpatil4u commented on May 10, 2024

Thanks @LaurentMazare. I am trying to build a few simple tutorials which could onboard newcomers to candle framework.

This is my optimizer code. Can you review it by any chance ?

let varmap = VarMap::new();
    let optimizer = candle_nn::SGD::new(varmap.all_vars(), 0.003);

    for _ in 0..100 {
        let (x,y) = get_batch(&train_data);
        let xb: Tensor = Tensor::new(&x,&Device::Cpu).unwrap();
        let yb: Tensor = Tensor::new(&y,&Device::Cpu).unwrap();
        let (_, loss) = m.forward(&xb, &yb);
        optimizer.backward_step(&loss).unwrap();
        println!("loss: {:?}", loss);
    }

I am picking up xb and yb randomly.

fn get_batch(data: &[u32]) -> ([[u32; BLOCK_SIZE]; BATCH_SIZE], [[u32; BLOCK_SIZE]; BATCH_SIZE]) {
    let mut rng = ChaCha8Rng::seed_from_u64(SEED);

    let mut xx = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
    let mut yy = [[0u32; BLOCK_SIZE]; BATCH_SIZE];

    for batch_index in 0..BATCH_SIZE {
        let start = rng.gen_range(0..data.len() - BLOCK_SIZE);

        for block_index in 0..BLOCK_SIZE {
            xx[batch_index][block_index] = data[start + block_index];
            yy[batch_index][block_index] = data[start + block_index + 1];
        }
    }
    
    (xx, yy)
}

from candle.

LaurentMazare commented on May 10, 2024

Here is a script that should work based on your code. The tricky bit was that the optimizer should take as input a VarMap that contains the variable to optimize. So you want to use the VarMap struct that was also used to create the embedding layer.
Note that I've also removed the unwrap in favor of propagating the errors. I would suggest as possible improvements:

Using AdamW rather than SGD might make tweaking the learning rate easier.
Maybe get_bach could be a rust iterator?

use anyhow::Result;
use candle::{DType, Device, Tensor};
use candle_nn::{embedding, Embedding, VarBuilder, VarMap};
use rand::{Rng, SeedableRng};

pub const BATCH_SIZE: usize = 64;
pub const BLOCK_SIZE: usize = 128;
pub const VOCAB_SIZE: usize = 100;
pub const SEED: u64 = 299792458;

fn get_batch(
    data: &[u32],
) -> (
    [[u32; BLOCK_SIZE]; BATCH_SIZE],
    [[u32; BLOCK_SIZE]; BATCH_SIZE],
) {
    let mut rng = rand::rngs::StdRng::seed_from_u64(SEED);

    let mut xx = [[0u32; BLOCK_SIZE]; BATCH_SIZE];
    let mut yy = [[0u32; BLOCK_SIZE]; BATCH_SIZE];

    for batch_index in 0..BATCH_SIZE {
        let start = rng.gen_range(0..data.len() - BLOCK_SIZE);

        for block_index in 0..BLOCK_SIZE {
            xx[batch_index][block_index] = data[start + block_index];
            yy[batch_index][block_index] = data[start + block_index + 1];
        }
    }

    (xx, yy)
}

#[derive(Debug)]
pub struct BigramLanguageModel {
    token_embedding_table: Embedding,
}

impl BigramLanguageModel {
    // Constructor
    pub fn new(vocab_size: usize, vb: VarBuilder) -> Result<Self> {
        let token_embedding_table = embedding(vocab_size, vocab_size, vb)?;
        Ok(BigramLanguageModel {
            token_embedding_table,
        })
    }

    // Forward pass
    pub fn forward(&self, idx: &Tensor, targets: &Tensor) -> Result<(Tensor, Tensor)> {
        let logits = self.token_embedding_table.forward(idx)?;

        let shape = logits.shape().dims();
        let logits = logits.reshape(&[shape[0] * shape[1], shape[2]])?;

        println!("shape: {:?}", logits.shape());
        println!("targets shape: {:?}", targets.shape().dims()[0]);
        if targets.shape().dims()[0] != 1 {
            let targets = targets.reshape(&[shape[0] * shape[1]])?;
            let loss = candle_nn::loss::cross_entropy(&logits, &targets)?;
            Ok((logits, loss))
        } else {
            let loss = Tensor::zeros((1, 1), DType::F32, &Device::Cpu)?;
            Ok((logits, loss))
        }
    }
}

fn main() -> Result<()> {
    let dev = Device::Cpu;
    let varmap = VarMap::new();
    let vb = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
    let m = BigramLanguageModel::new(VOCAB_SIZE, vb)?;
    let optimizer = candle_nn::SGD::new(varmap.all_vars(), 0.3);

    let train_data = (0..1000).map(|i| i % VOCAB_SIZE as u32).collect::<Vec<_>>();

    for _ in 0..100 {
        let (x, y) = get_batch(&train_data);
        let xb: Tensor = Tensor::new(&x, &dev)?;
        let yb: Tensor = Tensor::new(&y, &dev)?;
        let (_, loss) = m.forward(&xb, &yb)?;
        optimizer.backward_step(&loss)?;
        println!("loss: {:?}", loss);
    }
    Ok(())
}

from candle.

okpatil4u commented on May 10, 2024

Thanks @LaurentMazare, this is very useful. AdamW is my next step. I just wanted build everything from scratch. So that different functions could be introduced at different steps. So rust iterator will be my next step.

I am curious about inspiration behind VarBuilder. Does it act as an efficient storage of neural weights even when scope of training function changes ?

Also, I am not seeing any multi core usage. How do I enable it ? This is my htop output during training.

from candle.

LaurentMazare commented on May 10, 2024

The VarBuilder is used to provide a model with variables so you typically pass it to functions that create model components. These functions can then retrieve a variable if it's loaded from disk or generate a random variable if the model is to be initialized. Variables are kept together with their path, e.g. encoder.layer1.mlp.weight that is used both when reading weights from disks or when saving trained weights to the disk.

Typical example derived from the mnist example:

use candle_nn::Linear;
// This is actually already in candle_nn::linear.
fn linear(in_dim: usize, out_dim: usize, vs: VarBuilder) -> Result<Linear> {
    let ws = vs.get_or_init((out_dim, in_dim), "weight", candle_nn::init::ZERO)?;
    let bs = vs.get_or_init(out_dim, "bias", candle_nn::init::ZERO)?;
    Ok(Linear::new(ws, Some(bs)))
}

struct Mlp {
    ln1: Linear,
    ln2: Linear,
}

impl Mlp {
    fn new(vs: VarBuilder) -> Result<Self> {
        let ln1 = linear(IMAGE_DIM, 100, vs.pp("ln1"))?;
        let ln2 = linear(100, LABELS, vs.pp("ln2"))?;
        Ok(Self { ln1, ln2 })
    }

    fn forward(&self, xs: &Tensor) -> Result<Tensor> {
        let xs = self.ln1.forward(xs)?;
        let xs = xs.relu()?;
        self.ln2.forward(&xs)
    }
}

This means that you can create a new Mlp by using a VarBuilder that is backed by a file for inference, e.g.:

let weights = unsafe { candle::safetensors::MmapedFile::new(weights_filename)? };
let weights = weights.deserialize()?;
let vb = VarBuilder::from_safetensors(vec![weights], DType::F32, &dev);
let model = Mlp.new(vb)?;

Or use a VarBuilder backed by a fresh VarMap if you want to train a model via:

// For training
let varmap = VarMap::new();
let vb = VarBuilder::from_varmap(&varmap, DType::F32, &dev);
let model = Mlp.new(vb)?;
...train...
varmap.save("mlp.safetensors")?;

The embedding layer is by default not multithreaded as it's usually more memory bounded than cpu bounded (we're likely to revisit this though when polishing things if it adds some performance). Matrix multiplication/convolutions and other intensive ops should be using multiple cores.

from candle.

okpatil4u commented on May 10, 2024

Thank you @LaurentMazare. That makes sense.

Could you help me out with a few of more queries ?

How would you use manual seed in randn initialization ?
Assumming that let x = Tensor::zeros( (B, T, C), DType::F32, &Device::Cpu)?;, how would you change values at x.i((b,t)) ?
Any chance you could prioritize MPS implementation ? There is a larger audience who will be moving from Llama.cpp and Rustformers/llm to this repo where they will be looking at training and inference at local device. Apple Silicon backend will definitely encourage those users in using this repo.

from candle.

LaurentMazare commented on May 10, 2024

Manual seeding is a work in progress but is not available at the moment, we're trying to figure out a good way to do it (the rng should probably be part of the device but it's annoying to have devices that are not Copy and that you have to pass by reference so we have to think a bit more about it).
You cannot mutate tensors and it's on purpose, only variables can be mutated. Maybe you want to use something like mask.where_cond(x, y)?
We actually added support for accelerate over the last couple days, it's probably not as good as having full metal support but should already provide some good speed up. You can use it with --features accelerate, please let us know if you see anything weird going on with it as it's certainly less tested than the rest.

from candle.

okpatil4u commented on May 10, 2024

Thanks @LaurentMazare. I will check it out.

One more question, are you using accelerate for BLAS, LAPACK or BNNS ? What kind of speed ups have you observed ?

from candle.

okpatil4u commented on May 10, 2024

Apologies, I guess you are using accelerate-src to get this going. I will go into the implementation as I get more command on the framework. Should I close this issue ?

from candle.

LaurentMazare commented on May 10, 2024

Right, on my linux x86 box I see mkl being regulary 2x or 3x faster. With accelerate on the matmul from cpu_benchmarks.rs I saw a 5x acceleration but that may be a very specific case. Feel free to close this issue if the original question was answered (and obviously to open new ones if you have further/different questions or to re-open if you have more comments on the same topic)

from candle.

okpatil4u commented on May 10, 2024

Thanks @LaurentMazare

from candle.

Bigram Model about candle HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs