GithubHelp home page GithubHelp logo

coreylowman / dfdx Goto Github PK

View Code? Open in Web Editor NEW
1.6K 31.0 96.0 2.66 MB

Deep learning in Rust, with shape checked tensors and neural networks

License: Other

Rust 92.15% Cuda 7.58% GLSL 0.26% WGSL 0.02%
rust autograd autodiff machine-learning neural-network autodifferentiation rust-lang backpropagation tensor deep-learning deep-neural-networks cuda cuda-kernels cuda-support cuda-toolkit gpu gpu-acceleration gpu-computing cudnn

dfdx's Introduction

dfdx: shape checked deep learning in rust

crates.io docs.rs

Ergonomics & safety focused deep learning in Rust.

Still in pre-alpha state. The next few releases are planned to be breaking releases.

Features at a glance:

  1. ๐Ÿ”ฅ GPU accelerated tensor library with shapes up to 6d!
  2. Shapes with both compile and runtime sized dimensions. (e.g. Tensor<(usize, Const<10>)> and Tensor<Rank2<5, 10>>)
  3. A large library of tensor operations (including matmul, conv2d, and much more).
    1. All tensor operations shape and type checked at compile time!!
  4. Ergonomic neural network building blocks (like Linear, Conv2D, and Transformer).
  5. Standard deep learning optimizers such as Sgd, Adam, AdamW, RMSprop, and more.

dfdx is on crates.io! Use by adding this to your Cargo.toml:

dfdx = "0.13.0"

See the documentation at docs.rs/dfdx.

[1] https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation

Design Goals

  1. Ergonomics the whole way down (both frontend interface & internals).
  2. Check as much at compile time as possible (i.e. don't compile if something is not correct).
  3. Maximize performance.
  4. Minimize unsafe code[1]
  5. Minimize Rc<RefCell> used in internal code[2]

[1] Currently the only unsafe calls are for matrix multiplication.

[2] The only things that use Arc are tensors to store their data. Arc is used instead of Box to reduce allocations when tensors are cloned.

GPU acceleration with CUDA

Enable the cuda feature to start using the Cuda device! Requires the installation of nvidia's cuda toolkit. See feature flags docs for more info.

API Preview

Check examples/ for more details.

  1. ๐Ÿ‘Œ Simple Neural Networks API, completely shape checked at compile time.
type Mlp = (
    (Linear<10, 32>, ReLU),
    (Linear<32, 32>, ReLU),
    (Linear<32, 2>, Tanh),
);

fn main() {
    let dev: Cuda = Default::default(); // or `Cpu`
    let mlp = dev.build_module::<Mlp, f32>();
    let x: Tensor<Rank1<10>, f32, Cpu> = dev.zeros();
    let y: Tensor<Rank1<2>, f32, Cpu> = mlp.forward(x);
    mlp.save("checkpoint.npz")?;
}
  1. ๐Ÿ“ˆ Ergonomic Optimizer API
type Model = ...
let mut model = dev.build_module::<Model, f32>();
let mut grads = model.alloc_grads();
let mut sgd = Sgd::new(&model, SgdConfig {
    lr: 1e-2,
    momentum: Some(Momentum::Nesterov(0.9))
});

let loss = ...
grads = loss.backward();

sgd.update(&mut model, &grads);
  1. ๐Ÿ’ก Const tensors can be converted to and from normal rust arrays
let t0: Tensor<Rank0, f32, _> = dev.tensor(0.0);
assert_eq!(t0.array(), &0.0);

let t1 /*: Tensor<Rank1<3>, f32, _>*/ = dev.tensor([1.0, 2.0, 3.0]);
assert_eq!(t1.array(), [1.0, 2.0, 3.0]);

let t2: Tensor<Rank2<2, 3>, f32, _> = dev.sample_normal();
assert_ne!(t2.array(), [[0.0; 3]; 2]);

Fun/notable implementation details

Module

pub trait Module<Input> {
    type Output;
    fn forward(&self, input: Input) -> Self::Output;
}

From this flexible trait we get:

  1. Single & batched inputs (just have multiple impls!)
  2. Multiple inputs/outputs (multi-headed modules, or rnns)
  3. Behavior different when tape is present or not (not the .train()/.eval() behavior present in other libraries!).

Tuples represent feedforward (a.k.a sequential) modules

Since we can implement traits for tuples, which is not possible in other languages AFAIK, they provide a very nice frontend for sequentially executing modules.

// no idea why you would do this, but you could!
type Model = (ReLU, Sigmoid, Tanh);
let model = dev.build_module::<Model, f32>();
type Model = (Linear<10, 5>, Tanh)
let model = dev.build_module::<Model, f32>();

How implementing Module for a 2-tuple looks:

impl<Input, A, B> Module<Input> for (A, B)
where
    Input: Tensor,
    A: Module<Input>,        // A is a module that takes Input
    B: Module<A::Output>,    // B is a module that takes A's Output
{
    type Output = B::Output; // the output of this is B's Output
    fn forward(&self, x: Input) -> Self::Output {
        let x = self.0.forward(x);
        let x = self.1.forward(x);
        x
    }
}

Modules implemented for Tuples up to 6 elements, but you can arbitrarily nest them!

No Rc<RefCells<T>> used - Gradient tape is not kept behind a cell!

Other implementations may store a reference to the gradient tape directly on tensors, which requires mutating tensors or using Rc/Refcells all over the place.

We've figured out an elegant way to avoid this, reducing references and dynamic borrow checks to 0!

Since all operations result in exactly 1 child, we can always move the gradient tape to the child of the last operation. Additionally, no model parameters (all tensors) will ever own the gradient tape because they will never be the result of any operation. This means we know exactly which tensor owns the gradient tape, and the tensors that have it will always be intermediate results that don't need to be maintained across gradient computation.

All of this together gives users unprecedented control/precision over what tensors are recorded on the gradient tape!

One advanced use case requires that tensors be re-used multiple times in a computation graph. This can be handled by cloning the tensor, and manually moving the gradient tape around.

Type checked backward

tl;dr: If you forget to include a call to trace() or traced(), the program won't compile!

-let pred = module.forward(x);
+let pred = module.forward(x.traced(grads));
let loss = (y - pred).square().mean();
let gradients = loss.backward();

Since we know exactly what tensors own the gradient tape, we can require the tensor passed into .backward() to own the gradient tape! And further, we can require it be moved into .backward(), so it can destruct the tape and construct the gradients!

All of this can be checked at compile time ๐ŸŽ‰

๐Ÿ“„ Validated against pytorch

All functions & operations are tested against behavior shown by similar code in pytorch.

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

dfdx's People

Contributors

cbournhonesque avatar ccaven avatar clstatham avatar coreylowman avatar daughterofmars avatar dimev avatar favilo avatar infalmo avatar inflectrix avatar jafioti avatar jcrist1 avatar kstavro avatar leodog896 avatar m1ngxu avatar narsil avatar nkoppel avatar opfromthestart avatar optman avatar quietlychris avatar rainiwu avatar swfsql avatar timerertim avatar timwedde avatar vasanthakumarv avatar vikigenius avatar viliamvadocz avatar xbagon avatar yannickfricke avatar yerke avatar zojeda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dfdx's Issues

Add Batch sampler utility class

Something that takes a usize length of the dataset, and you can:

  1. Sample batches of a const known size
  2. Iterate shuffled batches of const known size

Each of these would return a [usize; M] where M is a const M: usize

Add `concatenate` function

Needed for #34 . In multi head attention, you concatenate the output of all the single attention heads.

Note that this may require nightly access similar to #1 because we can't do expressions with const generics yet.

Add something nn layer for multi head

This would be variable sized head where the input to the module is duplicated and the same input is passed to all sub modules.

Unclear how this would work since we are already using tuples. Perhaps something like:

impl Module<I> for MultiHead<(A, B)> {}
impl Module<I> for MultiHead<(A, B, C)> {}
impl Module<I> for MultiHead<(A, B, C, D)> {}
...

?

Add `gather_last_dim()`

This would accept an array of T::Reduced::ArrayType, where Dtype is usize, and select the items from last dimension that match up. It would return a Tensor::Reduced.

Example:

let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = gather_last_dim(t, [0, 1]);
assert_eq!(r.data(), &[1.0, -2.0]);

Roadmap

0.9.0 - nightly conv nets & transformers

Comparison against pytorch (patch version bump)

Misc other generic const exprs functions (patch version bump)


Released v0.5.1 - Mnist example with linear MLP

Released v0.5.2 - RL examples & save/load

Released v0.6.0 - transformers prep & other additions

add hard_cross_entropy

Current only works for actual probability distributions. hard cross entropy only has 1 non zero entry in inner dimension, so sum across that before taking mean

Clone for UniqueId should produce a different id

For safety & clarity reasons. If you clone a tensor for backprop, more often than not you want that to be a different tensor and for it to be treated separately during backprop.

For cases where you do want to keep the id the same, .duplicate() should be used.

The only place this really occurs is in kl_div_with_logits_loss where target_probs is cloned sicne it's used twice.

Select subset operation

something like

fn select<const S: usize>(self, inds: &[usize, N]) -> Self<S, ...>;

I imagine the gradients for this would just be 1 if i is in inds, otherwise 0

Use OpenBLAS/BLAS/IntelMKL for matrix multiplication

Currently using the matrixmultiply crate, but I think performance could be much improved with using the actual BLAS library. Unclear how compiling/including that works since it has to be compiled per machine.

Transformers mega issue

Would like to add an small example of using a transformer architecture. This will likely involve new features such as batch mat mul and maybe some others.

Randomize parameters based on parameter size

E.g. for xavier uniform initialization you need to know the in size & out size.

This will likely require a different trait than Randomize, and I'm still inclined to keep randomize. It'll also be slightly easier to use since the user won't have to pass in a distribution.

Options:

  • model.reset_params(&mut rng);
  • model.init_params(&mut rng);
  • model.randomize_params(&mut rng);

This should use Tensor::randomize() under the hood.

Save/load from numpy file

This will need:

  • Write single tensor to .npy file
  • Create a zip with multiple files from a struct
  • Ability to read a single np array from file into a tensor
  • Ability to read a collection of np arrays into a arbitrarily nested struct of tensors

Add `nn::DropoutOneIn<N>`

Ideally we'd have p be a const parameter. unfortunately f32 cannot be const in stable.

Many uses cases make p 1 / N, where N is just an integer.

Dropout1In<N> would set p to be 1.0 / N as f32 for now.

Add multiple dtype support

This will be another generic parameter of all tensors. Most existing operations will likely require float generic.

Related to #9 since it involves an additional generic parameter

GPU Mega Issue

There's a lot of work to be done here. Very rough list of todos:

  • Preparation

    • Move map functions to devices #199
    • Move conv to devices #198
    • Add where clauses for map functions to make partial progress on kernels possible (so we can start using cuda without all ops implemented)
  • Devices

    • Add Cuda device that wraps cudarc::CudaDevice and an rng
    • Add StdRng to Cpu
    • Add rng seed to device construction
    • Add two GATs to device trait: DeviceArc and DeviceRng
      • Add CpuRc which contains Arc<T> and Arc<Cpu>
  • Tensors

    • Add Device to all tensor structs
    • TensorCreator should accept &Device as parameter, and remove Rng since that will be accessed through device
    • Move Device to generic argument of Tensors
    • Enable moving tensors between devices
  • nn

    • Add trait ModuleCreator
      • Add ModuleCreator::zeros(Device)
      • Add ModuleCreator::default(Device) which calls zeros & reset params
    • Remove implementations for Default
    • Remove rng parameter from ResetParams, should use tensor's devices
  • Kernels

    • Add trait LaunchKernel<K, Args>
    • Move all Cpu traits to a combo of impl LaunchKernel<...> for Cpu and trait <Kernel>CpuImpl/impl <Kernel>CpuImpl for <Kernel>. See cudarc/examples/kernels.rs
    • (In a separate crate) proc macro that wraps around kernels and maps them to something usable for ptx compiling (e.g. kernel!(|a, b, c| { *a = b + c }) (#185)
    • Look into when/how to build the kernels (compile time hopefully??) (#184)
  • Testing

    • Add feature based device construction in all tests (something like #[cfg(feature="test-cuda"]) that when specified uses cuda instead of cpu?
    • Add macro build_test_device!() to use that uses testing features to create the device

Done:

  • Is it even possible to compile a rust closure to a cuda kernel? Assuming very small set of supported operations. Is this worth the maintainability?
    • If we go the fixed set of functions route, how many different generic closures does dfdx use currently?
    • ANSWER: Yes it is possible (the rust cuda project does it), but it will take some work. Automatic closure conversion to kernel is probably the direction i'll be trying to go since hand building all the cuda kernels next to the cpu closures seems too much work.
  • What functionality does nvidia provide for deep learning already? Assuming matmul & conv forward/backward. How to use these?
    • ANSWER: cudnn, all tensors are 4d, supports base set of operations. probably not what we want to depend on tbh since it doesn't support everything we would need on GPU (e.g. optimizer kernels)

Add `max_last_dim()`

This would reduce last dim to the maximum value in that dimension. It can use T::Device::reduce_last_dim(..., &mut f32::max) (see logsumexp for example using that).

Example:

let t: Tensor2D<2, 3> = Tensor2D::new([[1.0, 2.0, 3.0], [-1.0, -2.0, -3.0]]);
let r: Tensor1D<2> = max_last_dim(t);
assert_eq!(r.data(), &[3.0, -1.0]);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.