Comments (25)
Tensor backend - 1. Mutable operations on slices
References:
Currently slicing returns immutable tensors and you need an extra var assignment to mutate the original tensor which is quite burdensome.
i don't remember what were the errors, iirc tests gave wrong result and it was painful to debug.
from arraymancer.
Tensor backend - 2. Nested parallelism
We currently cannot benefit from nested parallelism in Arraymancer, a parallel function calling a parallel function will cause oversubscription due to OpenMP limitations, for example the Frobenius inner product
Arraymancer/src/arraymancer/tensor/lapack.nim
Lines 19 to 22 in f809e7f
Slowness of reduction operations like sigmnoid or softmax the more cores you have
Solution is switching to Weave (or Constantine's threadpool) as a backend.
Related:
from arraymancer.
Tensor backend - 3. Doc generation
Doc generation is a pain point, see wishlist #488 (comment) and haxscramper/haxdoc#1
though having those auto-generated in CI is awesome: #556
from arraymancer.
Tensor backend - 4. Versioning / releases
Reserved
from arraymancer.
Tensor backend - 5. Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)
Background: #265 (comment)
While working on Laser I've looked into many areas and found huge slowness in the
exp
andlog
function from the C standard library <math.h>, that notably bottlenecks sigmoid and softmax.On my machine with all optimisations, <math.h> can achieve about 160 millions exponentials per second per thread while the fastest can achieve 1.3 billions (so about 10x faster).
PyTorch is not using the fastest implementation but theirs can still achieve 900 millions exponentials per second. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/avx_mathfun.h
I.e. I can't trust anyone :/.
There is a tradeoff there between accuracy which is very needed for some applications, e.g. weather forecast where errors compound exponentially and others typically deep learning where quantizing on 4-bit is actually fine.
See 8-way accuracy/speed comparison for AVX2 impl of exponentiation: https://github.com/mratsim/laser/blob/master/benchmarks/vector_math/bench_exp_avx2.nim#L282-L470
Either this and nested parallelism or both are a significant bottleneck for the Shakespeare char-rnn demo.
from arraymancer.
Tensor backend - 6, 7, 8 OS-specific woes
On Windows, BLAS and Lapack are a pain to deploy. Even in CI I did not figure out how to install Lapack
Arraymancer/.github/workflows/ci.yml
Lines 26 to 30 in f809e7f
See similar woes here: https://forum.nim-lang.org/t/10812
On MacOS, Apple Clang doesn't support OpenMP and requires installing LLVM via Homebrew.
The solution to both would be to implement a pure Nim BLAS backed by a pure Nim multithreading runtime like Weave.
The hardest primitive, matrix multiplication has already been implemented as a benchmark for Weave, with decent performance: even after 3 years of progress on OpenBLAS or Intel MKL it still achieves 80%~90% of perf, https://github.com/mratsim/weave/tree/b6255af/benchmarks/matmul_gemm_blas
Lastly, on MacOS, with has unified memory it's costless to use Tensor cores for matrix multiplication.
Currently as we use Apple Accelerate we're covered, but if we use our own BLAS like Weave's, we'll have to go through Apple AMX instructions: https://github.com/corsix/amx
from arraymancer.
The need for untyped tensors
When I started Arraymancer, I first was really excited about the possibility of encoding all size information in the type system to have a robust library with everything a compile-time error.
I quickly backpedaled due to ergonomic issues, see this May 2017 discussion here (time flies) andreaferretti/linear-algebra#5 (comment)
The main reason is that it prevents me from creating a sequence of closures: Matrix[2, 3] -> Matrix[3, 3] and Matrix[3, 3] -> Matrix[3,3]
And there are other problems that surfaced later which confirmed that not putting dimensions in the type system was better ergonomic-wise, for example for the squeeze
which deletes singleton dimension from a tensor.
But actually putting the type is also problematic ergonomically:
- serialization/deserialization needs the type as input
- library that build upon tensors for arbitrary inputs like dataframes will need to handle that themselves (https://github.com/SciNim/Datamancer/blob/f35119080428919ad51658c4bea958fbd2d54d39/src/datamancer/column.nim#L9-L35)
type ColKind* = enum colNone, colFloat, colInt, colBool, colString, colObject, colConstant, colGeneric Column* = ref object len*: int case kind*: ColKind of colFloat: fCol*: Tensor[float] of colInt: iCol*: Tensor[int] of colBool: bCol*: Tensor[bool] of colString: sCol*: Tensor[string] of colObject: oCol*: Tensor[Value] of colConstant: cCol*: Value of colGeneric: discard # added when overwriting `Column` type # gKiloGram: Tensor[KiloGram] # gKiloGram²: Tensor[KiloGram²] # # or # case gKind: colGenericKind # of kKiloGram: gKiloGram: Tensor[KiloGram] # of kKiloGram²: gKiloGram²: Tensor[KiloGram²] of colNone: discard ColumnLike* = concept x x.len is int x.kind is ColKind BuiltInTypes* = float | char | int | bool | string | Value SupportedTypes* = SomeNumber | BuiltInTypes
The scientific computing world doesn't work with static types in serialization, when deserializing csv, json, ONNX or tfrecords, you are expected to read the format from the file.
Similarly Arraymancer NN layers/models would need to be completely type-erased for proper loading.
See also https://forum.nim-lang.org/t/10223#67808
Now, it's probable that libraries depend on Arraymancer support for static T
for example for dual-numbers.
So a way forward to allow ergonomic serialization/deserialization (#163, #417) from common scientific format would be starting a new library, with a restricted set of types in mind (all types used in Numpy / PyTorch). The internal Arraymancer typed primitives can be reused however.
from arraymancer.
Reserved
from arraymancer.
Neural network backend - 1. Nvidia Cuda
Cuda support has been broken since Nim v1.2:
The culprit is: nim-lang/Nim#16936
There are 3 ways to fix this:
- Create a nim.cfg config for NVCC either in Nim, https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg or in Arraymancer. The main issue is that it switch the compiler to nvcc for everything. Then --passC flags need special handling, they all need to be prefixed by
-Xcompiler -myflag
, so it become hard to compose with other libraries.
Note: we can directly link to the dll/.so without needing the header files from hereLine 14 in f809e7f
- Using a JIT like LLVM, the scaffolding code is available in Constantine:
- codegen: https://github.com/mratsim/constantine/blob/077cfb2/constantine/platforms/code_generator/nvidia.nim#L158-L270
- Hello World: https://github.com/mratsim/constantine/blob/077cfb2/tests/gpu/hello_world_nvidia.nim#L147-L203
but then you need a compiler to transform Nim to LLVM IR like nlvm
- Using NVRTC, NVRTC (Nvidia Runtime Compilation library) is something I actually missed when implementing Arraymancer, well it was introduced in Cuda 7 in 2015, and I started Arraymancer with Cuda 7 so there was no tutorials. NVRTC allows on-the-fly compilation of Cuda kernels, similar to how OpenCL works today in Arraymancer. The code can be provided as in-memory
.cu
files. I.e. we keep all the current interpolated C++ but remove the emit and nvcc.
from arraymancer.
Neural Network backend - 2. Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized
Implementing Neural Network layers is extremely time-consuming, you need:
- Implement the forward pass
- Make it fast
- Implement the backward pass
- Make it fast
- Rince and repeat on every accelerator/GPU backend
This require several expertise in a team: linear-algebra and high-performance computing. And both require significant investment of time to achieve.
The only way to compete is by creating a compiler, only the forward pass would need to be implemented and the rest is automatically derived.
The idea is very well captured in https://people.csail.mit.edu/tzumao/gradient_halide/ and has been discussed extensively in #347 .
This led to start experimentations on a deep learning compiler called Lux, in Laser:
- https://github.com/mratsim/laser/tree/master/laser/lux_compiler
- https://github.com/mratsim/laser/blob/master/laser/lux_compiler/core/challenges.md
- https://github.com/mratsim/laser/blob/master/laser/lux_compiler/core/low_level_design_considerations.md
Then I identified that we need a multithreading runtime to power this compiler, which led to https://github.com/mratsim/weave/
Then I got pulled by other projects.
Fortunately, there is another project with the same idea: https://github.com/can-lehmann/exprgrad that went as far as generating OpenCL code.
from arraymancer.
Neural network backend - 3. Ergonomic serialization and deserialization of models
This is a very important problem at the moment:
References:
The issue is that Model / Layers are statically typed which means you need end-user to properly type everything.
This is a significant ergonomic limitation when other frameworks can just load a .onnx file and be done.
It also prevents Arraymancer being used as a library as you would need recompilation.
The solution is to type-erase all layers and model, via inherited ref objects for example.
from arraymancer.
Neural network backend - 4. Slowness of reduction operations like sigmnoid or softmax the more cores you have
Another key problem which causes scaling problems. The issue is that reduction operations are always used in deep learning since any loss function is a reduction.
See
Weave and Constantine's threadpool take into account those slowness, see benches:
- https://github.com/mratsim/weave/tree/b6255af/benchmarks/logsumexp
- https://github.com/mratsim/constantine/tree/077cfb2/benchmarks-threadpool/logsumexp
In summary, we need to allow nested parallelism and use loop tiling / loop blocking as well
from arraymancer.
Reserved
from arraymancer.
Summary
In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.
- Create a compiler for tensor arithmetic which supports CPU, OpenCL, Nvidia backends.
- Optimization at the compiler level is likely out of reach and would require a polyhedral optimizer which is why Lux (or Halide) exposed low-level optimization details (tiling, parallelism, intermediate storage).
- Note, for the autogenerated backpropagation, we will need an optimizer
- Use Weave or Constantine's threadpool instead of OpenMP for the CPU parallel runtime.
- Implement a pure Nim BLAS & Lapack replacement. This may or may not use the compiler.
- Port Arraymancer primitives to those that compiler, threadpool and BLAS/Lapack.
- Create a new library with type-erased tensors and (de)serializable models that will be focused on scientific computing interop (.csv, .onnx, .tfrecords) and deep-learning
- Deprecate the deep learning part of Arraymancer, keeping only Numpy/Scipy/Tensor, and redirect deep learning needs to that new library.
from arraymancer.
@mratsim Appreciate all the work you've done for the community man, not just for this library. Some time in Q2 I'll be able to return to a hobby project and will look into improving Arraymancer.
I wonder if we can trim the need for various backends by leveraging WebGPU. It has first class support for compute shaders, it's nearly as fast as vulkan, portable to EVERY device imaginable, and can be ran natively (it isn't just for browser). I think that reduces a lot of work load. For the CPU backend 100% we'll want to use something like weave and a pure Nim BLAS implementation ( how hard can that be? :) )
from arraymancer.
For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).
from arraymancer.
For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).
Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.
from arraymancer.
Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.
FFTW cannot be distributed with Arraymancer due its GPL license, but it's available as a separate package: https://github.com/SciNim/nimfftw3
What can be distributed with Arraymancer is https://github.com/scinim/impulse
also some important signal processing algorithms (FFT, filtering and convolution).
@AngelEzquerra, see https://github.com/scinim/impulse for a low-level implementation.
from arraymancer.
I didn't know about impulse. I can have a look, but what about @arnetheduck's nim-fffr (https://github.com/arnetheduck/nim-fftr)? It's MIT licensed and according to his benchmarks it's pretty fast. I've played a bit with it (and even made a couple of PR's to it) and it is pretty easy to use...
from arraymancer.
The gap is large at the moment. And for large FFTs you really need multithreading.
fftr
1.423 ms 1.461 ms ±0.017 x1000 pow2 - 65536
244.258 ms 246.235 ms ±1.584 x21 prime - 746497
47.279 ms 47.748 ms ±0.308 x105 prime-power - 160801
1.915 ms 1.959 ms ±0.023 x1000 mult-of-power-of-2 - 20736
4.171 ms 4.251 ms ±0.037 x1000 small-comp-large-prime -
11.567 ms 11.741 ms ±0.067 x426 small-comp - 100000
fftw
0.285 ms 0.471 ms ±0.348 x1000 pow2 - 65536
55.563 ms 63.118 ms ±8.796 x79 prime - 746497
5.052 ms 5.345 ms ±0.471 x904 prime-power - 160801
0.114 ms 0.137 ms ±0.033 x1000 mult-of-power-of-2 - 20736
0.711 ms 0.754 ms ±0.025 x1000 small-comp-large-prime - 30270
0.529 ms 0.560 ms ±0.033 x1000 small-comp - 100000
from arraymancer.
for prime powers, fftr uses bluestein which is quite inefficient, ie it's an algorithm implementation away to get it on par with fftw (if anything, https://github.com/ejmahler/RustFFT is one of the fastest ones, beating fftw too) - re multithreading, that's something you'd set up outside of the core fft algorithm I suspect
from arraymancer.
re multithreading, that's something you'd set up outside of the core fft algorithm I suspect
For load-balancing, recursive divide-and-conquer algorithm are best because they ensure that there are plenty of tasks so no CPU is starved. This is very easy with Cooley-Tukey as function calls: https://github.com/mratsim/constantine/blob/bc5faaaef8e270b6e4913a704e1132f96bfe7349/research/kzg/fft_fr.nim#L91-L150
i.e. you replace all fft_internal
calls with spawn fft_internal
func simpleFT[F](
output: var View[F],
vals: View[F],
rootsOfUnity: View[F]
) =
# FFT is a recursive algorithm
# This is the base-case using a O(n²) algorithm
let L = output.len
var last {.noInit.}, v {.noInit.}: F
for i in 0 ..< L:
last.prod(vals[0], rootsOfUnity[0])
for j in 1 ..< L:
v.prod(vals[j], rootsOfUnity[(i*j) mod L])
last += v
output[i] = last
func fft_internal[F](
output: var View[F],
vals: View[F],
rootsOfUnity: View[F]
) =
if output.len <= 4:
simpleFT(output, vals, rootsOfUnity)
return
# Recursive Divide-and-Conquer
let (evenVals, oddVals) = vals.splitAlternate()
var (outLeft, outRight) = output.splitMiddle()
let halfROI = rootsOfUnity.skipHalf()
fft_internal(outLeft, evenVals, halfROI)
fft_internal(outRight, oddVals, halfROI)
let half = outLeft.len
var y_times_root{.noinit.}: F
for i in 0 ..< half:
# FFT Butterfly
y_times_root .prod(output[i+half], rootsOfUnity[i])
output[i+half] .diff(output[i], y_times_root)
output[i] += y_times_root
func fft*[F](
desc: FFTDescriptor[F],
output: var openarray[F],
vals: openarray[F]): FFT_Status =
if vals.len > desc.maxWidth:
return FFTS_TooManyValues
if not vals.len.uint64.isPowerOf2_vartime():
return FFTS_SizeNotPowerOfTwo
let rootz = desc.expandedRootsOfUnity
.toView()
.slice(0, desc.maxWidth-1, desc.maxWidth div vals.len)
var voutput = output.toView()
fft_internal(voutput, vals.toView(), rootz)
return FFTS_Success
If using the for loop approach like on Wikipedia second pseudocode
lgorithm iterative-fft is
input: Array a of n complex values where n is a power of 2.
output: Array A the DFT of a.
bit-reverse-copy(a, A)
n ← a.length
for s = 1 to log(n) do
m ← 2s
ωm ← exp(−2πi/m)
for k = 0 to n-1 by m do
ω ← 1
for j = 0 to m/2 – 1 do
t ← ω A[k + j + m/2]
u ← A[k + j]
A[k + j] ← u + t
A[k + j + m/2] ← u – t
ω ← ω ωm
return A
We can do parallel-for loops.
But an interface/concepts for either spawning tasks or parallelizing loop is needed, and then needs to be use for effective parallelization of the FFT.
from arraymancer.
The gap is large at the moment. And for large FFTs you really need multithreading.
fftr
1.423 ms 1.461 ms ±0.017 x1000 pow2 - 65536 244.258 ms 246.235 ms ±1.584 x21 prime - 746497 47.279 ms 47.748 ms ±0.308 x105 prime-power - 160801 1.915 ms 1.959 ms ±0.023 x1000 mult-of-power-of-2 - 20736 4.171 ms 4.251 ms ±0.037 x1000 small-comp-large-prime - 11.567 ms 11.741 ms ±0.067 x426 small-comp - 100000
fftw
0.285 ms 0.471 ms ±0.348 x1000 pow2 - 65536 55.563 ms 63.118 ms ±8.796 x79 prime - 746497 5.052 ms 5.345 ms ±0.471 x904 prime-power - 160801 0.114 ms 0.137 ms ±0.033 x1000 mult-of-power-of-2 - 20736 0.711 ms 0.754 ms ±0.025 x1000 small-comp-large-prime - 30270 0.529 ms 0.560 ms ±0.033 x1000 small-comp - 100000
Do you have similar measurements for impulse?
from arraymancer.
Not at the moment. It would be interesting to make some. impulse
at the moment uses the same backend as numpy/scipy.
from arraymancer.
@mratsim, I tried to install impulse via nimble but it failed. Assuming we used it to implement the FFT in arraymancer, would you expect to embed it in arraymancer or to install it as a dependency? If the latter, I assume it would require impulse to be a proper nimble package, right?
from arraymancer.
Related Issues (20)
- einsum in proc fails when result is a Tensor HOT 4
- example 3 breaks with Nim >= 1.6 HOT 3
- Issues w/ softmax procedure HOT 11
- Importing model refusing to output any information that is new? HOT 1
- Ability to clone neural nets instead of all the weights/biases tensors manually HOT 3
- set[int] used that depends on odd compiler behavior.
- Device workload split feature HOT 2
- Fix CI by fixing doc generation
- Error: type mismatch: got 'float' for 'sum(conv2d(x, dkernel, dbias, padding, stride, Im2ColGEMM))' but expected 'Tensor[system.float]'
- Multiplication of tensors with rank > 2 HOT 6
- Can't read hdf5: io_hdf5.nim(17, 6) Error: 'parseNameAndGroup' can have side effects
- ShallowCopy problem using Nim v2 RCx HOT 2
- Futuristic premium
- Using the .item function on a Complex Tensor crashes
- Using arange within a [] expression fails HOT 1
- delete it
- Pretty printing larger-than-printable Tensor hangs the program
- Tensor.append crashes when used on empty tensors
- `size` returns 1 for rank-0 tensors
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arraymancer.