Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the t

mratsim,arraymancer

Comments (25)

mratsim commented on May 23, 2024

Tensor backend - 1. Mutable operations on slices

References:

Currently slicing returns immutable tensors and you need an extra var assignment to mutate the original tensor which is quite burdensome.

Tried in November 2017 9b55390
Tried in April 2020 b64ea0c

i don't remember what were the errors, iirc tests gave wrong result and it was painful to debug.

from arraymancer.

mratsim commented on May 23, 2024

Tensor backend - 2. Nested parallelism

We currently cannot benefit from nested parallelism in Arraymancer, a parallel function calling a parallel function will cause oversubscription due to OpenMP limitations, for example the Frobenius inner product

Arraymancer/src/arraymancer/tensor/lapack.nim

Lines 19 to 22 in f809e7f

 proc frobenius_inner_prod*[T](a,b: Tensor[T]): T = 

 sum: 

 map2_inline(a,b): 

 x * y

which also might be source of woes from Slowness of reduction operations like sigmnoid or softmax the more cores you have

Solution is switching to Weave (or Constantine's threadpool) as a backend.

#485
#330

from arraymancer.

mratsim commented on May 23, 2024

Tensor backend - 3. Doc generation

Doc generation is a pain point, see wishlist #488 (comment) and haxscramper/haxdoc#1

though having those auto-generated in CI is awesome: #556

from arraymancer.

mratsim commented on May 23, 2024

Tensor backend - 4. Versioning / releases

Reserved

from arraymancer.

mratsim commented on May 23, 2024

Tensor backend - 5. Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...)

Background: #265 (comment)

While working on Laser I've looked into many areas and found huge slowness in the exp and log function from the C standard library <math.h>, that notably bottlenecks sigmoid and softmax.

Benchmark.

On my machine with all optimisations, <math.h> can achieve about 160 millions exponentials per second per thread while the fastest can achieve 1.3 billions (so about 10x faster).

PyTorch is not using the fastest implementation but theirs can still achieve 900 millions exponentials per second. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/avx_mathfun.h

I.e. I can't trust anyone :/.

There is a tradeoff there between accuracy which is very needed for some applications, e.g. weather forecast where errors compound exponentially and others typically deep learning where quantizing on 4-bit is actually fine.

See 8-way accuracy/speed comparison for AVX2 impl of exponentiation: https://github.com/mratsim/laser/blob/master/benchmarks/vector_math/bench_exp_avx2.nim#L282-L470

Either this and nested parallelism or both are a significant bottleneck for the Shakespeare char-rnn demo.

from arraymancer.

mratsim commented on May 23, 2024

Tensor backend - 6, 7, 8 OS-specific woes

On Windows, BLAS and Lapack are a pain to deploy. Even in CI I did not figure out how to install Lapack

Arraymancer/.github/workflows/ci.yml

Lines 26 to 30 in f809e7f

 # TODO: install openblas or lapack 

 # - os: windows 

 # cpu: amd64 

 # - os: windows 

 # cpu: i386

#512 #465 #422

See similar woes here: https://forum.nim-lang.org/t/10812

On MacOS, Apple Clang doesn't support OpenMP and requires installing LLVM via Homebrew.

The solution to both would be to implement a pure Nim BLAS backed by a pure Nim multithreading runtime like Weave.
The hardest primitive, matrix multiplication has already been implemented as a benchmark for Weave, with decent performance: even after 3 years of progress on OpenBLAS or Intel MKL it still achieves 80%~90% of perf, https://github.com/mratsim/weave/tree/b6255af/benchmarks/matmul_gemm_blas

Lastly, on MacOS, with has unified memory it's costless to use Tensor cores for matrix multiplication.
Currently as we use Apple Accelerate we're covered, but if we use our own BLAS like Weave's, we'll have to go through Apple AMX instructions: https://github.com/corsix/amx

from arraymancer.

mratsim commented on May 23, 2024

The need for untyped tensors

When I started Arraymancer, I first was really excited about the possibility of encoding all size information in the type system to have a robust library with everything a compile-time error.

I quickly backpedaled due to ergonomic issues, see this May 2017 discussion here (time flies) andreaferretti/linear-algebra#5 (comment)

The main reason is that it prevents me from creating a sequence of closures: Matrix[2, 3] -> Matrix[3, 3] and Matrix[3, 3] -> Matrix[3,3]

And there are other problems that surfaced later which confirmed that not putting dimensions in the type system was better ergonomic-wise, for example for the squeeze which deletes singleton dimension from a tensor.

But actually putting the type is also problematic ergonomically:

serialization/deserialization needs the type as input

library that build upon tensors for arbitrary inputs like dataframes will need to handle that themselves (https://github.com/SciNim/Datamancer/blob/f35119080428919ad51658c4bea958fbd2d54d39/src/datamancer/column.nim#L9-L35)

type
  ColKind* = enum
    colNone, colFloat, colInt, colBool, colString, colObject, colConstant, colGeneric
  Column* = ref object
    len*: int
    case kind*: ColKind
    of colFloat: fCol*: Tensor[float]
    of colInt: iCol*: Tensor[int]
    of colBool: bCol*: Tensor[bool]
    of colString: sCol*: Tensor[string]
    of colObject: oCol*: Tensor[Value]
    of colConstant: cCol*: Value
    of colGeneric: discard # added when overwriting `Column` type
    #  gKiloGram: Tensor[KiloGram]
    #  gKiloGram²: Tensor[KiloGram²]
    #  # or
    #  case gKind: colGenericKind
    #  of kKiloGram: gKiloGram: Tensor[KiloGram]
    #  of kKiloGram²: gKiloGram²: Tensor[KiloGram²]
    of colNone: discard

  ColumnLike* = concept x
    x.len is int
    x.kind is ColKind

  BuiltInTypes* = float | char | int | bool | string | Value
  SupportedTypes* = SomeNumber | BuiltInTypes

The scientific computing world doesn't work with static types in serialization, when deserializing csv, json, ONNX or tfrecords, you are expected to read the format from the file.
Similarly Arraymancer NN layers/models would need to be completely type-erased for proper loading.

Now, it's probable that libraries depend on Arraymancer support for static T for example for dual-numbers.

So a way forward to allow ergonomic serialization/deserialization (#163, #417) from common scientific format would be starting a new library, with a restricted set of types in mind (all types used in Numpy / PyTorch). The internal Arraymancer typed primitives can be reused however.

from arraymancer.

mratsim commented on May 23, 2024

Reserved

from arraymancer.

mratsim commented on May 23, 2024

Neural network backend - 1. Nvidia Cuda

Cuda support has been broken since Nim v1.2:

#435
#478

The culprit is: nim-lang/Nim#16936

There are 3 ways to fix this:

Create a nim.cfg config for NVCC either in Nim, https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg or in Arraymancer. The main issue is that it switch the compiler to nvcc for everything. Then --passC flags need special handling, they all need to be prefixed by -Xcompiler -myflag, so it become hard to compose with other libraries.
Note: we can directly link to the dll/.so without needing the header files from here

Arraymancer/nim.cfg

Line 14 in f809e7f

cincludes:"/opt/cuda/include"
Using a JIT like LLVM, the scaffolding code is available in Constantine:
- codegen: https://github.com/mratsim/constantine/blob/077cfb2/constantine/platforms/code_generator/nvidia.nim#L158-L270
- Hello World: https://github.com/mratsim/constantine/blob/077cfb2/tests/gpu/hello_world_nvidia.nim#L147-L203
  but then you need a compiler to transform Nim to LLVM IR like nlvm
Using NVRTC, NVRTC (Nvidia Runtime Compilation library) is something I actually missed when implementing Arraymancer, well it was introduced in Cuda 7 in 2015, and I started Arraymancer with Cuda 7 so there was no tutorials. NVRTC allows on-the-fly compilation of Cuda kernels, similar to how OpenCL works today in Arraymancer. The code can be provided as in-memory .cu files. I.e. we keep all the current interpolated C++ but remove the emit and nvcc.

from arraymancer.

mratsim commented on May 23, 2024

Neural Network backend - 2. Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized

Implementing Neural Network layers is extremely time-consuming, you need:

Implement the forward pass
Make it fast
Implement the backward pass
Make it fast
Rince and repeat on every accelerator/GPU backend

This require several expertise in a team: linear-algebra and high-performance computing. And both require significant investment of time to achieve.

The only way to compete is by creating a compiler, only the forward pass would need to be implemented and the rest is automatically derived.
The idea is very well captured in https://people.csail.mit.edu/tzumao/gradient_halide/ and has been discussed extensively in #347 .

This led to start experimentations on a deep learning compiler called Lux, in Laser:

Then I identified that we need a multithreading runtime to power this compiler, which led to https://github.com/mratsim/weave/

Then I got pulled by other projects.

Fortunately, there is another project with the same idea: https://github.com/can-lehmann/exprgrad that went as far as generating OpenCL code.

from arraymancer.

mratsim commented on May 23, 2024

Neural network backend - 3. Ergonomic serialization and deserialization of models

This is a very important problem at the moment:

References:

#163

The issue is that Model / Layers are statically typed which means you need end-user to properly type everything.
This is a significant ergonomic limitation when other frameworks can just load a .onnx file and be done.
It also prevents Arraymancer being used as a library as you would need recompilation.

The solution is to type-erase all layers and model, via inherited ref objects for example.

from arraymancer.

mratsim commented on May 23, 2024

Neural network backend - 4. Slowness of reduction operations like sigmnoid or softmax the more cores you have

Another key problem which causes scaling problems. The issue is that reduction operations are always used in deep learning since any loss function is a reduction.

See

Weave and Constantine's threadpool take into account those slowness, see benches:

In summary, we need to allow nested parallelism and use loop tiling / loop blocking as well

from arraymancer.

mratsim commented on May 23, 2024

Reserved

from arraymancer.

mratsim commented on May 23, 2024

Summary

In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.

Create a compiler for tensor arithmetic which supports CPU, OpenCL, Nvidia backends.
- Optimization at the compiler level is likely out of reach and would require a polyhedral optimizer which is why Lux (or Halide) exposed low-level optimization details (tiling, parallelism, intermediate storage).
- Note, for the autogenerated backpropagation, we will need an optimizer
Use Weave or Constantine's threadpool instead of OpenMP for the CPU parallel runtime.
Implement a pure Nim BLAS & Lapack replacement. This may or may not use the compiler.
Port Arraymancer primitives to those that compiler, threadpool and BLAS/Lapack.
Create a new library with type-erased tensors and (de)serializable models that will be focused on scientific computing interop (.csv, .onnx, .tfrecords) and deep-learning
Deprecate the deep learning part of Arraymancer, keeping only Numpy/Scipy/Tensor, and redirect deep learning needs to that new library.

from arraymancer.

Niminem commented on May 23, 2024

@mratsim Appreciate all the work you've done for the community man, not just for this library. Some time in Q2 I'll be able to return to a hobby project and will look into improving Arraymancer.

I wonder if we can trim the need for various backends by leveraging WebGPU. It has first class support for compute shaders, it's nearly as fast as vulkan, portable to EVERY device imaginable, and can be ran natively (it isn't just for browser). I think that reduces a lot of work load. For the CPU backend 100% we'll want to use something like weave and a pure Nim BLAS implementation ( how hard can that be? :) )

from arraymancer.

AngelEzquerra commented on May 23, 2024

For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).

from arraymancer.

hlclemson commented on May 23, 2024

For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution).

Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.

from arraymancer.

mratsim commented on May 23, 2024

Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used.

FFTW cannot be distributed with Arraymancer due its GPL license, but it's available as a separate package: https://github.com/SciNim/nimfftw3

What can be distributed with Arraymancer is https://github.com/scinim/impulse

also some important signal processing algorithms (FFT, filtering and convolution).

@AngelEzquerra, see https://github.com/scinim/impulse for a low-level implementation.

from arraymancer.

AngelEzquerra commented on May 23, 2024

I didn't know about impulse. I can have a look, but what about @arnetheduck's nim-fffr (https://github.com/arnetheduck/nim-fftr)? It's MIT licensed and according to his benchmarks it's pretty fast. I've played a bit with it (and even made a couple of PR's to it) and it is pretty easy to use...

from arraymancer.

mratsim commented on May 23, 2024

The gap is large at the moment. And for large FFTs you really need multithreading.

fftr

   1.423 ms    1.461 ms  ±0.017  x1000 pow2 - 65536
 244.258 ms  246.235 ms  ±1.584    x21 prime - 746497
  47.279 ms   47.748 ms  ±0.308   x105 prime-power - 160801
   1.915 ms    1.959 ms  ±0.023  x1000 mult-of-power-of-2 - 20736
   4.171 ms    4.251 ms  ±0.037  x1000 small-comp-large-prime -
  11.567 ms   11.741 ms  ±0.067   x426 small-comp - 100000

fftw

   0.285 ms    0.471 ms  ±0.348  x1000 pow2 - 65536
  55.563 ms   63.118 ms  ±8.796    x79 prime - 746497
   5.052 ms    5.345 ms  ±0.471   x904 prime-power - 160801
   0.114 ms    0.137 ms  ±0.033  x1000 mult-of-power-of-2 - 20736
   0.711 ms    0.754 ms  ±0.025  x1000 small-comp-large-prime - 30270
   0.529 ms    0.560 ms  ±0.033  x1000 small-comp - 100000

from arraymancer.

arnetheduck commented on May 23, 2024

for prime powers, fftr uses bluestein which is quite inefficient, ie it's an algorithm implementation away to get it on par with fftw (if anything, https://github.com/ejmahler/RustFFT is one of the fastest ones, beating fftw too) - re multithreading, that's something you'd set up outside of the core fft algorithm I suspect

from arraymancer.

mratsim commented on May 23, 2024

re multithreading, that's something you'd set up outside of the core fft algorithm I suspect

For load-balancing, recursive divide-and-conquer algorithm are best because they ensure that there are plenty of tasks so no CPU is starved. This is very easy with Cooley-Tukey as function calls: https://github.com/mratsim/constantine/blob/bc5faaaef8e270b6e4913a704e1132f96bfe7349/research/kzg/fft_fr.nim#L91-L150

i.e. you replace all fft_internal calls with spawn fft_internal

func simpleFT[F](
       output: var View[F],
       vals: View[F],
       rootsOfUnity: View[F]
     ) =
  # FFT is a recursive algorithm
  # This is the base-case using a O(n²) algorithm

  let L = output.len
  var last {.noInit.}, v {.noInit.}: F

  for i in 0 ..< L:
    last.prod(vals[0], rootsOfUnity[0])
    for j in 1 ..< L:
      v.prod(vals[j], rootsOfUnity[(i*j) mod L])
      last += v
    output[i] = last

func fft_internal[F](
       output: var View[F],
       vals: View[F],
       rootsOfUnity: View[F]
     ) =
  if output.len <= 4:
    simpleFT(output, vals, rootsOfUnity)
    return

  # Recursive Divide-and-Conquer
  let (evenVals, oddVals) = vals.splitAlternate()
  var (outLeft, outRight) = output.splitMiddle()
  let halfROI = rootsOfUnity.skipHalf()

  fft_internal(outLeft, evenVals, halfROI)
  fft_internal(outRight, oddVals, halfROI)

  let half = outLeft.len
  var y_times_root{.noinit.}: F

  for i in 0 ..< half:
    # FFT Butterfly
    y_times_root   .prod(output[i+half], rootsOfUnity[i])
    output[i+half] .diff(output[i], y_times_root)
    output[i]      += y_times_root

func fft*[F](
       desc: FFTDescriptor[F],
       output: var openarray[F],
       vals: openarray[F]): FFT_Status =
  if vals.len > desc.maxWidth:
    return FFTS_TooManyValues
  if not vals.len.uint64.isPowerOf2_vartime():
    return FFTS_SizeNotPowerOfTwo

  let rootz = desc.expandedRootsOfUnity
                  .toView()
                  .slice(0, desc.maxWidth-1, desc.maxWidth div vals.len)

  var voutput = output.toView()
  fft_internal(voutput, vals.toView(), rootz)
  return FFTS_Success

If using the for loop approach like on Wikipedia second pseudocode

lgorithm iterative-fft is
    input: Array a of n complex values where n is a power of 2.
    output: Array A the DFT of a.
 
    bit-reverse-copy(a, A)
    n ← a.length 
    for s = 1 to log(n) do
        m ← 2s
        ωm ← exp(−2πi/m) 
        for k = 0 to n-1 by m do
            ω ← 1
            for j = 0 to m/2 – 1 do
                t ← ω A[k + j + m/2]
                u ← A[k + j]
                A[k + j] ← u + t
                A[k + j + m/2] ← u – t
                ω ← ω ωm
   
    return A

We can do parallel-for loops.

But an interface/concepts for either spawning tasks or parallelizing loop is needed, and then needs to be use for effective parallelization of the FFT.

from arraymancer.

AngelEzquerra commented on May 23, 2024

The gap is large at the moment. And for large FFTs you really need multithreading.

fftr

   1.423 ms    1.461 ms  ±0.017  x1000 pow2 - 65536
 244.258 ms  246.235 ms  ±1.584    x21 prime - 746497
  47.279 ms   47.748 ms  ±0.308   x105 prime-power - 160801
   1.915 ms    1.959 ms  ±0.023  x1000 mult-of-power-of-2 - 20736
   4.171 ms    4.251 ms  ±0.037  x1000 small-comp-large-prime -
  11.567 ms   11.741 ms  ±0.067   x426 small-comp - 100000

fftw

   0.285 ms    0.471 ms  ±0.348  x1000 pow2 - 65536
  55.563 ms   63.118 ms  ±8.796    x79 prime - 746497
   5.052 ms    5.345 ms  ±0.471   x904 prime-power - 160801
   0.114 ms    0.137 ms  ±0.033  x1000 mult-of-power-of-2 - 20736
   0.711 ms    0.754 ms  ±0.025  x1000 small-comp-large-prime - 30270
   0.529 ms    0.560 ms  ±0.033  x1000 small-comp - 100000

Do you have similar measurements for impulse?

from arraymancer.

mratsim commented on May 23, 2024

Not at the moment. It would be interesting to make some. impulse at the moment uses the same backend as numpy/scipy.

from arraymancer.

AngelEzquerra commented on May 23, 2024

@mratsim, I tried to install impulse via nimble but it failed. Assuming we used it to implement the FFT in arraymancer, would you expect to embed it in arraymancer or to install it as a dependency? If the latter, I assume it would require impulse to be a proper nimble package, right?

from arraymancer.

2023-12-31 - Longstanding missing features about arraymancer HOT 25 OPEN

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	proc frobenius_inner_prod*[T](a,b: Tensor[T]): T =
	sum:
	map2_inline(a,b):
	x * y

	# TODO: install openblas or lapack
	# - os: windows
	# cpu: amd64
	# - os: windows
	# cpu: i386