coreylowman / dfdx Goto Github PK

Deep learning in Rust, with shape checked tensors and neural networks

License: Other

Rust 92.15% Cuda 7.58% GLSL 0.26% WGSL 0.02%

rust autograd autodiff machine-learning neural-network autodifferentiation rust-lang backpropagation tensor deep-learning deep-neural-networks cuda cuda-kernels cuda-support cuda-toolkit gpu gpu-acceleration gpu-computing cudnn

dfdx's Introduction

dfdx: shape checked deep learning in rust

Ergonomics & safety focused deep learning in Rust.

Still in pre-alpha state. The next few releases are planned to be breaking releases.

Features at a glance:

🔥 GPU accelerated tensor library with shapes up to 6d!
Shapes with both compile and runtime sized dimensions. (e.g. Tensor<(usize, Const<10>)> and Tensor<Rank2<5, 10>>)
A large library of tensor operations (including matmul, conv2d, and much more).
1. All tensor operations shape and type checked at compile time!!
Ergonomic neural network building blocks (like Linear, Conv2D, and Transformer).
Standard deep learning optimizers such as Sgd, Adam, AdamW, RMSprop, and more.

dfdx is on crates.io! Use by adding this to your Cargo.toml:

dfdx = "0.13.0"

See the documentation at docs.rs/dfdx.

[1] https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation

Design Goals

Ergonomics the whole way down (both frontend interface & internals).
Check as much at compile time as possible (i.e. don't compile if something is not correct).
Maximize performance.
Minimize unsafe code[1]
Minimize Rc<RefCell> used in internal code[2]

[1] Currently the only unsafe calls are for matrix multiplication.

[2] The only things that use Arc are tensors to store their data. Arc is used instead of Box to reduce allocations when tensors are cloned.

GPU acceleration with CUDA

Enable the cuda feature to start using the Cuda device! Requires the installation of nvidia's cuda toolkit. See feature flags docs for more info.

API Preview

Check examples/ for more details.

👌 Simple Neural Networks API, completely shape checked at compile time.

type Mlp = (
    (Linear<10, 32>, ReLU),
    (Linear<32, 32>, ReLU),
    (Linear<32, 2>, Tanh),
);

fn main() {
    let dev: Cuda = Default::default(); // or `Cpu`
    let mlp = dev.build_module::<Mlp, f32>();
    let x: Tensor<Rank1<10>, f32, Cpu> = dev.zeros();
    let y: Tensor<Rank1<2>, f32, Cpu> = mlp.forward(x);
    mlp.save("checkpoint.npz")?;
}

📈 Ergonomic Optimizer API

type Model = ...
let mut model = dev.build_module::<Model, f32>();
let mut grads = model.alloc_grads();
let mut sgd = Sgd::new(&model, SgdConfig {
    lr: 1e-2,
    momentum: Some(Momentum::Nesterov(0.9))
});

let loss = ...
grads = loss.backward();

sgd.update(&mut model, &grads);

💡 Const tensors can be converted to and from normal rust arrays

let t0: Tensor<Rank0, f32, _> = dev.tensor(0.0);
assert_eq!(t0.array(), &0.0);

let t1 /*: Tensor<Rank1<3>, f32, _>*/ = dev.tensor([1.0, 2.0, 3.0]);
assert_eq!(t1.array(), [1.0, 2.0, 3.0]);

let t2: Tensor<Rank2<2, 3>, f32, _> = dev.sample_normal();
assert_ne!(t2.array(), [[0.0; 3]; 2]);

Fun/notable implementation details

Module

pub trait Module<Input> {
    type Output;
    fn forward(&self, input: Input) -> Self::Output;
}

From this flexible trait we get:

Single & batched inputs (just have multiple impls!)
Multiple inputs/outputs (multi-headed modules, or rnns)
Behavior different when tape is present or not (not the .train()/.eval() behavior present in other libraries!).

Tuples represent feedforward (a.k.a sequential) modules

Since we can implement traits for tuples, which is not possible in other languages AFAIK, they provide a very nice frontend for sequentially executing modules.

// no idea why you would do this, but you could!
type Model = (ReLU, Sigmoid, Tanh);
let model = dev.build_module::<Model, f32>();

type Model = (Linear<10, 5>, Tanh)
let model = dev.build_module::<Model, f32>();

How implementing Module for a 2-tuple looks:

impl<Input, A, B> Module<Input> for (A, B)
where
    Input: Tensor,
    A: Module<Input>,        // A is a module that takes Input
    B: Module<A::Output>,    // B is a module that takes A's Output
{
    type Output = B::Output; // the output of this is B's Output
    fn forward(&self, x: Input) -> Self::Output {
        let x = self.0.forward(x);
        let x = self.1.forward(x);
        x
    }
}

Modules implemented for Tuples up to 6 elements, but you can arbitrarily nest them!

No `Rc<RefCells<T>>` used - Gradient tape is not kept behind a cell!

Other implementations may store a reference to the gradient tape directly on tensors, which requires mutating tensors or using Rc/Refcells all over the place.

We've figured out an elegant way to avoid this, reducing references and dynamic borrow checks to 0!

Since all operations result in exactly 1 child, we can always move the gradient tape to the child of the last operation. Additionally, no model parameters (all tensors) will ever own the gradient tape because they will never be the result of any operation. This means we know exactly which tensor owns the gradient tape, and the tensors that have it will always be intermediate results that don't need to be maintained across gradient computation.

All of this together gives users unprecedented control/precision over what tensors are recorded on the gradient tape!

One advanced use case requires that tensors be re-used multiple times in a computation graph. This can be handled by cloning the tensor, and manually moving the gradient tape around.

Type checked backward

tl;dr: If you forget to include a call to trace() or traced(), the program won't compile!

-let pred = module.forward(x);
+let pred = module.forward(x.traced(grads));
let loss = (y - pred).square().mean();
let gradients = loss.backward();

Since we know exactly what tensors own the gradient tape, we can require the tensor passed into .backward() to own the gradient tape! And further, we can require it be moved into .backward(), so it can destruct the tape and construct the gradients!

All of this can be checked at compile time 🎉

📄 Validated against pytorch

All functions & operations are tested against behavior shown by similar code in pytorch.

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

dfdx's People

Contributors

Stargazers

Watchers

Forkers

strasdat yerke joseph-x-li isgasho bonedaddy icodein xbagon vikigenius elftausend hominee farm-ng weykon dimev caelunshun cbournhonesque gmorenz edhuang577 jianshu93 m1ngxu jyudelson1 kod-kristoff bbenshalom shouvikghosh2048 hbcbh1999 viliamvadocz nkoppel narsil leofidus quietlychris codingonion usbalbin kakoc timerertim amadeusine kstavro paholg 08r4y4n ai-learn-use legneato daughterofmars opfromthestart nabushika obiemunoz scriptis natyamatsya zojeda vasanthakumarv ccaven iacore lykhouzov memoryruins j4qfrost mauvray sirandreww yannickfricke myname1111 awpteamoose mattjurenka kurnevsky mayhemheroes sdake artificialwisdomai 9876691 ronofays iq-scm jafioti lebrancebw donisaac xubaiw pesky01 ue2020 d-berg jcrist1 emchristiansen oyelowo clstatham dogpawhat leodog896 swfsql tanmaysachan timwedde gmh5225 melshakobyan96 shivampr21 optman hscspring favilo jrazek fujiehuang vinicius-ianni inflectrix rainiwu seddonm1 mrtnvgr schaudge gurudk

dfdx's Issues

Safe allocation of arrays on heap

This function would stop using alloc_zeroed() and Box::from_raw() https://github.com/coreylowman/dfdx/blob/main/src/devices/allocate.rs#L11:

        let layout = Layout::new::<T>();
        debug_assert_eq!(layout.size(), T::NUM_BYTES);
        unsafe {
            let ptr = alloc_zeroed(layout) as *mut T;
            Box::from_raw(ptr)
        }

Add Batch sampler utility class

Something that takes a usize length of the dataset, and you can:

Sample batches of a const known size
Iterate shuffled batches of const known size

Each of these would return a [usize; M] where M is a const M: usize

Document differences from pytorch & other rust DL crates

E.g.

const generic sizes
clean/safe/hackable implementation
Gradients not stored directly on parameters, makes updating & writing optimizers much cleaner

Add Adam optimizer

pytorch's adam page has psuedo code for this: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html?highlight=adam#torch.optim.Adam

Reformat nans_to/neg/value_mask as diff_fns

Cpu Device & change arrays_ops to Cpu device impls

A device should be in charge of allocating tensors/arrays, and doing operations on them. This should resolve stack overflow from #16, and make adding gpu device (#9) easier in the future.

Add versions of functions that take two references & tape as a third argument.

While this would force functions to allocate space for derivatives inside, it would be cleaner from an api perspective.

E.g.

fn matmul_ref(a: &..., b: &..., tape: H) {}

Plot benchmark speed against pytorch

Linear batched forward (matmul & broadcast add)
Backprop algorithm
Optimizer updates
forward with tape & without tape

broadcast_inner_* should take NoTape as lhs, similar to other arith functions

This will also slightly reduce the required movement of tape when using these functions.

Add `concatenate` function

Needed for #34 . In multi head attention, you concatenate the output of all the single attention heads.

Note that this may require nightly access similar to #1 because we can't do expressions with const generics yet.

Document everything

Add something nn layer for multi head

This would be variable sized head where the input to the module is duplicated and the same input is passed to all sub modules.

Unclear how this would work since we are already using tuples. Perhaps something like:

impl Module<I> for MultiHead<(A, B)> {}
impl Module<I> for MultiHead<(A, B, C)> {}
impl Module<I> for MultiHead<(A, B, C, D)> {}
...

Fix docstrings in macros

Add example ppo

Improve implementation of broadcast_outer_* methods

add/sub and mul/div are all slightly different, and they are kinda hard to read.

Add `gather_last_dim()`

This would accept an array of T::Reduced::ArrayType, where Dtype is usize, and select the items from last dimension that match up. It would return a Tensor::Reduced.

Example:

let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = gather_last_dim(t, [0, 1]);
assert_eq!(r.data(), &[1.0, -2.0]);

Add example training on mnist

Roadmap

0.9.0 - nightly conv nets & transformers

Comparison against pytorch (patch version bump)

Misc other generic const exprs functions (patch version bump)

Released v0.5.1 - Mnist example with linear MLP

Released v0.5.2 - RL examples & save/load

Released v0.6.0 - transformers prep & other additions

#36
(minor breaking change) #41
#40
#38
(patch breaking change) #35
#33
#37
#51
(breaking) #44
(breaking) #46
(breaking) #47
(breaking) #41
#53

Rename `tape.add_operation` to `tape.add_backward_op`

Convolution mega issue

Add `flatten` layer/functions

Add clamp layer/function

add hard_cross_entropy

Current only works for actual probability distributions. hard cross entropy only has 1 non zero entry in inner dimension, so sum across that before taking mean

Clone for UniqueId should produce a different id

For safety & clarity reasons. If you clone a tensor for backprop, more often than not you want that to be a different tensor and for it to be treated separately during backprop.

For cases where you do want to keep the id the same, .duplicate() should be used.

The only place this really occurs is in kl_div_with_logits_loss where target_probs is cloned sicne it's used twice.

Stack overflow with large layer sizes. Tensors should hold `Box<Array>`

Select subset operation

something like

fn select<const S: usize>(self, inds: &[usize, N]) -> Self<S, ...>;

I imagine the gradients for this would just be 1 if i is in inds, otherwise 0

Is there a way to remove extra allocation for derivative storage in matmul/vecmul functions?

It currently:

Copies rhs data
Creates an empty version of rhs to store derivative

It'd be nice to reuse one, but because of the order of operations it may be impossible; to compute lhs derivative you need rhs & result, and to compute rhs derivative you need lhs & result.

Test Linear & tuple modules

Add Binary Cross Entropy loss

Add `nn::LayerNorm`

Needed for transformers #34

Remove std::ops::Mul for matmul - matmul should always be explicit

Use OpenBLAS/BLAS/IntelMKL for matrix multiplication

Currently using the matrixmultiply crate, but I think performance could be much improved with using the actual BLAS library. Unclear how compiling/including that works since it has to be compiled per machine.

Add `Repeated<T, N>` for repeating a module N times.

Needed for transformers #34

Transformers mega issue

Would like to add an small example of using a transformer architecture. This will likely involve new features such as batch mat mul and maybe some others.

Reduce allocation in `binary_map`

One of the arguments can be reused as the storage for the gradient.

Randomize parameters based on parameter size

E.g. for xavier uniform initialization you need to know the in size & out size.

This will likely require a different trait than Randomize, and I'm still inclined to keep randomize. It'll also be slightly easier to use since the user won't have to pass in a distribution.

Options:

model.reset_params(&mut rng);
model.init_params(&mut rng);
model.randomize_params(&mut rng);

This should use Tensor::randomize() under the hood.

Add Sgd with Momentum optimizer

pytorch's sgd page has pseudocode for this: https://pytorch.org/docs/stable/generated/torch.optim.SGD.html

Add versions of binary functions that take ownership of both arguments - for minimizing allocations

Examples where this would remove an allocation:

kl_divergence_with_logits_loss where target_probs is duplicated
binary_cross_entropy_with_logits_loss in b calculation where max_value is duplicated.

Save/load from numpy file

This will need:

Write single tensor to .npy file
Create a zip with multiple files from a struct
Ability to read a single np array from file into a tensor
Ability to read a collection of np arrays into a arbitrarily nested struct of tensors

Add DQN Example code (on random data)

Add `nn::DropoutOneIn<N>`

Ideally we'd have p be a const parameter. unfortunately f32 cannot be const in stable.

Many uses cases make p 1 / N, where N is just an integer.

Dropout1In<N> would set p to be 1.0 / N as f32 for now.

Preparation
- Move map functions to devices #199
- Move conv to devices #198
- Add where clauses for map functions to make partial progress on kernels possible (so we can start using cuda without all ops implemented)
Devices
- Add Cuda device that wraps cudarc::CudaDevice and an rng
- Add StdRng to Cpu
- Add rng seed to device construction
- ~~Add two GATs to device trait: DeviceArc and DeviceRng~~
  - ~~Add CpuRc which contains Arc<T> and Arc<Cpu>~~
Tensors
- Add Device to all tensor structs
- TensorCreator should accept &Device as parameter, and remove Rng since that will be accessed through device
- Move Device to generic argument of Tensors
- ~~Enable moving tensors between devices~~
nn
- Add trait ModuleCreator
  - Add ModuleCreator::zeros(Device)
  - Add ModuleCreator::default(Device) which calls zeros & reset params
- Remove implementations for Default
- Remove rng parameter from ResetParams, should use tensor's devices
Kernels
- Add trait LaunchKernel<K, Args>
- Move all Cpu traits to a combo of impl LaunchKernel<...> for Cpu and trait <Kernel>CpuImpl/impl <Kernel>CpuImpl for <Kernel>. See cudarc/examples/kernels.rs
- (In a separate crate) proc macro that wraps around kernels and maps them to something usable for ptx compiling (e.g. kernel!(|a, b, c| { *a = b + c }) (#185)
- Look into when/how to build the kernels (compile time hopefully??) (#184)
Testing
- Add feature based device construction in all tests (something like #[cfg(feature="test-cuda"]) that when specified uses cuda instead of cpu?
- Add macro build_test_device!() to use that uses testing features to create the device

Done:

Is it even possible to compile a rust closure to a cuda kernel? Assuming very small set of supported operations. Is this worth the maintainability?
- If we go the fixed set of functions route, how many different generic closures does dfdx use currently?
- ANSWER: Yes it is possible (the rust cuda project does it), but it will take some work. Automatic closure conversion to kernel is probably the direction i'll be trying to go since hand building all the cuda kernels next to the cpu closures seems too much work.
What functionality does nvidia provide for deep learning already? Assuming matmul & conv forward/backward. How to use these?
- ANSWER: cudnn, all tensors are 4d, supports base set of operations. probably not what we want to depend on tbh since it doesn't support everything we would need on GPU (e.g. optimizer kernels)

Add `max_last_dim()`

This would reduce last dim to the maximum value in that dimension. It can use T::Device::reduce_last_dim(..., &mut f32::max) (see logsumexp for example using that).