GithubHelp home page GithubHelp logo

masonprotter / bumper.jl Goto Github PK

View Code? Open in Web Editor NEW
145.0 145.0 5.0 104 KB

Bring Your Own Stack

License: MIT License

Julia 100.00%
arena-allocator array bump-allocator julia julia-language performance stack

bumper.jl's People

Contributors

jipolanco avatar masonprotter avatar pallharaldsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bumper.jl's Issues

Performance for very small arrays

Hi!

I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C.

https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8

The test first allocates the inverse squares of integers from 1 to N, and then performs the sum.

This is how it looks like for Bumper.jl:

@inline function sumArray_bumper(N)
    @no_escape begin
        smallarray = alloc(Float64, N) 
        @turbo for i ∈ 1:N
            smallarray[i] = 1.0 / i^2
        end
        sum = 0.0
        @turbo for i ∈ 1:N
            sum += smallarray[i]
        end
    end
    return sum
end

I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory.

My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper.

Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays?

Best,

Bug report: allocating custom abstract types

First of all: cool package and thanks for your work!

While I was working with the package I encountered an error starting with
"Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks."
so here is that bug report.

I wanted to allocate some memory for an array with an abstract eltype. Here is a minimal (not) working example:

using Bumper

abstract type MyType end

struct MyStruct <: MyType
    x::Int
end

Base.sizeof(::Type{MyType}) = sizeof(Int)

@no_escape begin
    foo_arr = @alloc(MyType, 10)
    println(foo_arr)
end

I suppose the answer might be 'you cannot define sizeof for your abstract type and expect things to work', but I wanted to open this bug report anyway as requested.

Here is the full stack trace:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x15f21dd2f5e -- _show_default at .\show.jl:465
in expression starting at C:\Users\wolf.nederpel\hive\intsect\scripts\random_julia_test.jl:11
_show_default at .\show.jl:465
show_default at .\show.jl:462 [inlined]
show at .\show.jl:457 [inlined]
show_delim_array at .\show.jl:1346
show_delim_array at .\show.jl:1335 [inlined]
show_vector at .\arrayshow.jl:530
show_vector at .\arrayshow.jl:515 [inlined]
show at .\arrayshow.jl:486 [inlined]
print at .\strings\io.jl:35
print at .\strings\io.jl:46
println at .\strings\io.jl:75
unknown function (ip: 0000015f21dd4f1f)
println at .\coreio.jl:4
unknown function (ip: 0000015f21dd2bcb)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_body at C:/workdir/src\interpreter.c:489
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
include_string at .\loading.jl:2070
_include at .\loading.jl:2130
include at .\client.jl:489
unknown function (ip: 0000015f21dc916b)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_stmt_value at C:/workdir/src\interpreter.c:174 [inlined]
eval_body at C:/workdir/src\interpreter.c:635
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
eval_user_input at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:150
repl_backend_loop at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:246
#start_repl_backend#46 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:231
start_repl_backend at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:228
#run_repl#59 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:389
run_repl at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:375
jfptr_run_repl_95895.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
#1013 at .\client.jl:432
jfptr_YY.1013_86694.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:812
#invokelatest#2 at .\essentials.jl:887 [inlined]
invokelatest at .\essentials.jl:884 [inlined]
run_main_repl at .\client.jl:416
exec_options at .\client.jl:333
_start at .\client.jl:552
jfptr__start_86719.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
true_main at C:/workdir/src\jlapi.c:582
jl_repl_entrypoint at C:/workdir/src\jlapi.c:731
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 635700 (Pool: 634725; Big: 975); GC: 1

Integration tests for DynamicExpressions.jl?

I read in the README:

If you use Bumper.jl, please consider submitting a sample of your use-case so I can include it in the test suite.

Happy to share that I just added support for Bumper.jl in DynamicExpressions.jl, which means people can soon also use it for SymbolicRegression.jl and PySR.

My use-case is coded up in this file with the important part being:

function bumper_eval_tree_array(
    tree::AbstractExpressionNode{T},
    cX::AbstractMatrix{T},
    operators::OperatorEnum,
    ::Val{turbo},
) where {T,turbo}
    result = similar(cX, axes(cX, 2))
    n = size(cX, 2)
    all_ok = Ref(false)
    @no_escape begin
        _result_ok = tree_mapreduce(
            # Leaf nodes, we create an allocation and fill
            # it with the value of the leaf:
            leaf_node -> begin
                ar = @alloc(T, n)
                ok = if leaf_node.constant
                    v = leaf_node.val::T
                    ar .= v
                    isfinite(v)
                else
                    ar .= view(cX, leaf_node.feature, :)
                    true
                end
                ResultOk(ar, ok)
            end,
            # Branch nodes, we simply pass them to the evaluation kernel:
            branch_node -> branch_node,
            # In the evaluation kernel, we combine the branch nodes
            # with the arrays created by the leaf nodes:
            ((args::Vararg{Any,M}) where {M}) ->
                dispatch_kerns!(operators, args..., Val(turbo)),
            tree;
            break_sharing=Val(true),
        )
        x = _result_ok.x
        result .= x
        all_ok[] = _result_ok.ok
    end
    return (result, all_ok[])
end

Basically it's a recursive evaluation scheme for an arbitrary symbolic expression over a 2D array of data. Preliminary result show a massive performance gain with bump allocation! Even faster than LoopVectorization (though the user could even turn on both, though I don't see much more of an improvement).

The way you can write an integration test is:

using DynamicExpressions: Node, OperatorEnum, eval_tree_array
using Bumper
using Random: MersenneTwister as RNG

operators = OperatorEnum(binary_operators=(+, -, *), unary_operators=(cos, exp))

x1 = Node{Float32}(feature=1)
x2 = Node{Float32}(feature=2)

tree = cos(x1 * 0.9 - 0.5) + x2 * exp(1.0 - x3 * x3)
# ^ This is a symbolic expression described as a type-stable binary tree

# Evaluate with Bumper:
X = randn(RNG(0), Float32, 2, 1000);

truth, no_nans_truth = eval_tree_array(tree, X, operators)
test, no_nans_test = eval_tree_array(tree, X, operators; bumper=true)

@test truth  test

You could also random generate expressions if you want to use this as a way to stress test the bump allocator. The code to generate trees is here

which lets you do

tree = gen_random_tree_fixed_size(20, operators, 2, Float32)

Cheers,
Miles

P.S., any tips on how I'm using bumper allocation would be much appreciated!! For example, I do know exactly how large the allocation should be in advance – can that help me get more perf at all?

Add some EnzymeRules

Currently, Enzyme.jl's reverse mode autodiff doesn't work correctly with Bumper.jl because if you give it a Duplicated buffer, it'll += accumulate results into the duplicated buffer making the answer depend on the state of the buffer at the start of the program.

It'd be good if we could set up some EnzymeRules to explicitly teach Enzyme how to handle Bumper.jl allocations and deallocations. I don't really know how to do this though, so if anyone wants to take it on, or work on it together please do.

Make `AllocBuffer` just store a pointer made (by default) by `malloc`

I'm not sure there's much advantage to me letting people wrap whatever type like like for this thing. Might be better to simply do:

mutable struct AllocBuffer
    ptr::Ptr{UInt8}
    length::Int
    offset::UInt8
end

function AllocBuffer(length::Int; finalize=true)
    ptr = malloc(length)
    out = AllocBuffer(ptr, length, UInt(0))
    if finalize
        finalizer(x -> free(x.ptr), out)
    end
    out
end

which'd make it more similar to SlabBuffer. This'd be a breaking change, so I'd like to do it before 1.0 if I do it.

Composing with distinct allocators

In ArrayAllocators.jl, I made some bindings for several allocations functions:

  1. posix_memalign
  2. VirtualAlloc2
  3. VirtualAllocEx
  4. numa_alloc_onnode
  5. numa_alloc_local

What would be a good way to compose ArrayAllocators.jl and Bumper.jl?

Massive slowdown when running with `--check-bounds=no`

When running julia with --check-bounds=no something goes wrong with Bumper. It should be noted in the docs.

The MWE is the example from the docs:

using Bumper
using BenchmarkTools
using StrideArrays

function f(x)
    # Set up a scope where memory may be allocated, and does not escape:
    @no_escape begin
        # Allocate a `PtrArray` (see StrideArraysCore.jl) using memory from the default buffer.
        y = @alloc(eltype(x), length(x))
        # Now do some stuff with that vector:
        y .= x .+ 1
        sum(y) # It's okay for the sum of y to escape the block, but references to y itself must not do so!
    end
end

@benchmark f(x) setup=(x = rand(1:10, 30))

Starting julia with --check-bounds=auto I get this output:

BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.837 ns … 41.080 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.998 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.250 ns ±  1.138 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▆▅▄▃▂▁  ▁▁▁   ▁                                          ▂
  ████████████████████▇▇▆█▆▆▅▅▅▆▆▆█▇▆▇▆▆▅▄▅▅▃▅▅▄▅▄▂▂▃▃▄▃▄▅▃▃▄ █
  19.8 ns      Histogram: log(frequency) by time      24.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With --check-bounds=no it is quite a bit slower, and allocating:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  147.137 μs …  4.958 ms  ┊ GC (min … max): 0.00% … 95.54%
 Time  (median):     152.287 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.173 μs ± 87.330 μs  ┊ GC (mean ± σ):  1.91% ±  3.47%

         ▁▁▂▅▆█▆▄▄▂▂▂▂▁▂▁                                       
  ▂▁▃▄▆▇████████████████████▇▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  147 μs          Histogram: frequency by time          166 μs <

 Memory estimate: 49.56 KiB, allocs estimate: 1050.

Julia Version 1.12.0-DEV.606
Commit 6f569c7ba0* (2024-05-27 08:27 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
WORD_SIZE: 64
LLVM: libLLVM-17.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
JULIA_EDITOR = emacs -nw

Possibility of tranforming existing functions to use `Bumper.jl`

Is it in principle possible to have some sort of a macro that would take a function and replace its inner calls to Vector{T}(undef, n) (and similar) with @alloc(T, n)? To me this appears possible after the calls are inlined and possibly some escape analysis applied but I have a very limited understanding of the problem.
So the macro could do something like:

function mapsum(f, x)
    arr = Vector{Float64}(undef, length(x))
    arr .= f.(x)
    return sum(arr)
end

transforms into

function mapsum_bumpered(f, x)
   @no_escape begin
        arr = @alloc(Float64, length(x))
        arr .= f.(x)
        ans = sum(arr)
    end
   return ans
end

Thanks!

Move into Julia and/or under JuliaLang, as stdlib?

Hi,

I believe your package has a good track record by now, just works. Probably not many know of it.

Should it be added to Julia, so that e.g. the compiler/optimizer can use it? It seems we could compete with Mojo that way. It deallocates as fast as possible, before variables go out of scope even (in languages like C++).

A first step would even be helpful on its own:

Phase 1.
Just move unchanged, gives more visibility (could also be had by documenting in Julia's docs). Julia itself wouldn't use. But at any point if could, by uses Bumper.jl as documented.

Phase 2.
This would be up to Julia people also, and the main win with merging. Make use of already existing idiomatic Julia code in or out of Julia use Bumper.jl transparently.

I recall our discussion, but can't find it, about dynamically adding to the buffer. I see it's now Task_local (would it be per thread, or is that in effect what it is?). I mentioned a problem with dynamically enlarging, so you backed away from it and now I found a solution, but it seems redundant, with changes I see you've now already implemented. I see you now allocate 1/8th of physical memory, which seems way excessive, which I think is the point so that you never have to enlarge. You rely on the VM (and [RAM] memory not actually used just virtual memory reserved, and the OS allocating more of it transparently). So why 1/8th? Why not even larger, all of it, or smaller? I'm guessing if you have e.g. 8 threads then you allocate all, and with 16 then 2x overcommit (which is ok, at least on Linux).

I do not believe overcommitting works on Windows however, so do you know of problems, if e.g. you have very many threads? Also say Julia's with 8 threads, and 4 such Julias running at once, is that ok? I don't know about macOS, but it's likely similar. Before merging, such use would need to be confirmed ok, or if lower from 1/8th...

95d51c7

Undefined function

Precompiling project...
  ✗ Bumper
  0 dependencies successfully precompiled in 3 seconds. 57 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]

Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /home/lime/.julia/compiled/v1.8/Bumper/jl_ujDpjX.
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
 [1] include
   @ ./Base.jl:419 [inlined]
 [2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
   @ Base ./loading.jl:1554
 [3] top-level scope
   @ stdin:1
in expression starting at /home/lime/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1

MethodErrors caused by StrideArrays.jl overrides

MWE:

In a fresh REPL:

julia> using Bumper: @no_escape, @alloc

julia> using Random: randn!

julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})

julia> @no_escape begin
           ar = @alloc(T, 100)
           randn!(ar)
           @. ar = cos(ar)
           sum(ar)
       end
109.13606f0 + 4.8591895f0im

However, if I import StrideArrays, I get an error:

julia> using Bumper: @no_escape, @alloc

julia> using StrideArrays

julia> using Random: randn!

julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})

julia> @no_escape begin
           ar = @alloc(T, 100)
           randn!(ar)
           @. ar = cos(ar)
           sum(ar)
       end
ERROR: MethodError: no method matching vmaterialize!(::PtrArray{…}, ::Base.Broadcast.Broadcasted{…}, ::Val{…}, ::Val{…}, ::Val{…})

Closest candidates are:
  vmaterialize!(::Any, ::Any, ::Val{Mod}, ::Val{UNROLL}) where {Mod, UNROLL}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:753
  vmaterialize!(::Union{LinearAlgebra.Adjoint{T, A}, LinearAlgebra.Transpose{T, A}}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, A<:AbstractArray{T, N}, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:682
  vmaterialize!(::AbstractArray{T, N}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:673
  ...

Stacktrace:
 [1] vmaterialize!
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:759 [inlined]
 [2] _materialize!
   @ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:181 [inlined]
 [3] materialize!(dest::PtrArray{…}, bc::Base.Broadcast.Broadcasted{…})
   @ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:188
 [4] macro expansion
   @ REPL[5]:4 [inlined]
 [5] macro expansion
   @ ~/.julia/packages/Bumper/eoK0g/src/internals.jl:74 [inlined]
 [6] top-level scope
   @ REPL[5]:1
Some type information was truncated. Use `show(err)` to see complete types.

I think maybe a fallback methods should be used if it doesn't exist?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add/use slab-bump allocator?

The basic idea is, you have slabs of some size.
When you run out of memory, you allocate a new slab.

Examples:
llvm: https://llvm.org/doxygen/Allocator_8h_source.html
LoopModels: https://github.com/JuliaSIMD/LoopModels/blob/bumprealloc/include/Utilities/Allocators.hpp

LoopModel's is largely a copy of LLVM's, but supports either a bump-up or bump-down.
LoopModel's slab size is constant, but LLVM's slabs grow.

A julia struct itself could look like

mutable struct BumpAlloc{Up,SlabSize}
    current::Ptr{Cvoid}
    slabend::Ptr{Cvoid}
    # you could try and get fancy and reduce the number of indirection's by having your own array type
    slabs::Vector{Ptr{Cvoid}}
    custom_slabs::Vector{Ptr{Cvoid}}
end
# should probably register a finalizer that `Libc.free`s all the pointers
# optionally use a faster library like `mimalloc` instead of `Libc`

The custom_slabs are for objects too big for the SlabSize.
The point of being separate was largely because in C++ there possibly are possibly faster free/delete functions that take the size (i.e. there might exist, and they might be faster).
Given that we don't have that here, we may as well fuse them, unless you find some allocator API that supports sizes.

Being able to grow lets you default to a much smaller slab size.

I was thinking about modifying SimpleChains to use something like this.

@no_escape is incompatible with Threads.@threads

julia> function f1()
           @no_escape begin
               y = @alloc(Int,10)
               Threads.@threads for val in y
                   println(val)
               end
           end
       end
ERROR: LoadError: The `return` keyword is not allowed to be used inside the `@no_escape` macro

I have to nest the for loop inside a function to trick it:

julia> function _f(y)
           Threads.@threads for val in y
               println(y)
           end
       end
_f (generic function with 1 method)
julia> function f2()
           @no_escape begin
               y = @alloc(Int,10)
               _f(y)
           end
       end
f2 (generic function with 1 method)

ldiv! does not accept PtrArray

First off, interesting package! I think my issue is more with StrideArrays and LinearAlgebra not meshing well and Bumper is caught in the middle.

The error I'm getting comes from trying to use ldiv! which requires a factorized matrix, but StridedArrays always tries to produce a PtrArray regardless of the function applied:

X = rand(100,100)
y = rand(100)

function f(X,y)

    numObs, numFeatures = size(X)
    T = eltype(X)

    @no_escape begin
        Xfact = @alloc(T, numObs, numFeatures)
        b = @alloc(T, numFeatures)
        ŷ = @alloc(T, numObs)

        Xfact .= X
        qr!(Xfact)
        ldiv!(b,Xfact,y) # <-- ERROR: MethodError: no method matching ldiv!(::PtrArray{…}, ::PtrArray{…})
        mul!(ŷ,X,b)

        err = sum((yᵢ - ŷᵢ)^2 for (yᵢ, ŷᵢ) in zip(y,ŷ)) / numObs
    end

    return err
end

I'm guessing there's no easy way to avoid using PtrArrays. I can use X\y but this of course allocates which kind of defeats the purpose.

@alloc detection too narrow

In the switchover from 0.3 to 0.4 when replacing the function based alloc with the macro based version I always got the error that the @alloc is not within a @no_escape block even though it obviously was.

turns out the problem was my usage as Bumper.@alloc instead of just @alloc. From what I can see from a quick glance over the code the replacement code is looking explicitly for an @alloc. Perhaps this could bei widened?

StackOverflow with eigen

Dear developers,
I found that the following code gives rise to a stack overflow:

using Bumper
using LinearAlgebra
 function trial(x)
       @no_escape begin
          T = @alloc(eltype(x), 2, 2)
          T .= 0
          T[1,1] = x
          T[2,2] = x
          eigval, eigvects = eigen(T)
          sum(eigval)
       end
end

julia> trial(2)

Generates the following error:

ERROR: StackOverflowError:
Stacktrace:
 [1] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:199 [inlined]
 [2] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:456 [inlined]
 [3] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:481 [inlined]
 [4] view(A::StrideArraysCore.PtrArray{Int64, 2, (1, 2), Tuple{Int64, Int64}, Tuple{Nothing, Nothing}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, i::StepRange{Int64, Int64}) (repeats 79984 times)
   @ StrideArraysCore ~/.julia/packages/StrideArraysCore/VyBzA/src/stridearray.jl:263

Am I using Bumper in the wrong way? My understanding is that the memory allocated inside @no_escape should not escape the block. Still, here, the block returns a scalar reduction of the allocated array, so the memory should not escape.

Is there another way to diagonalize a matrix allocated on the Bumper stack?

EDIT:
Also, the error occurs in the line that calls eigen(T).

alloc_nothrow needs improvement, or eliminating

I see that you want nothrow for StaticCompiler.jl, but there are some problems.

It will overwrite memory if you're not careful. I'm thinking you may want to check if the buffer is to small, and then there might be a way to rather just exit the program? I think you can print something on stderr first, and then exit(1), or is there some PANIC, similar to in Go?

While alloc_nothrow works in regular Julia, just not vice versa, why it exists, I think the functionality above could be folded into the regular alloc. If you really need to use the other Malloc, could you use that in all cases? It means an extra dependency on the other package, or maybe rather use Libc.malloc directly? You can use Libc.realloc, and then you need to use the best growing strategy yourself, but you already have one.

I'm not sure what using Julia's regular Vector buys you, then it will be tracked by Julia's GC, probably a minimal slowdown though, with no benefit, since you don't want your buffers reclaimed anyway. And it's just an array of bytes, can't contain pointers to other objects. Or actually it may be possible, but then will not be be considered by the GC anyway.

Slowdown when using `alloc!`

Hi,

I have the following example where I observe a 2x slowdown with Bumper.alloc!.

Could you please confirm that I use the package correctly?
Do you have ideas on how to fix this?

Thank you !

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        @no_escape custom_allocator begin
            work1(polys, custom_allocator)
        end
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator=nothing)
    res = 0
    for poly in polys
        new_poly = work2(poly, custom_allocator)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}, ::Nothing) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

###

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19

@btime work0(polys, use_custom_allocator=true)
#   14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19

Running on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC) 
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Precompilation error

I still get precompilation error with all the Octavian specified version and StrideArrays added. Is it on my end or pkg related ?

[ Info: Precompiling Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
 [1] include
   @ ./Base.jl:419 [inlined]
 [2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_de
ps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)                                                                                                             @ Base ./loading.jl:1554
 [3] top-level scope
   @ stdin:1
in expression starting at /Users/usr/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /Users/usr/.julia/compiled/v1.8/Bumper/jl_LpueaC.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:1707
  [3] compilecache
    @ ./loading.jl:1651 [inlined]
  [4] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1337
  [5] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
  [6] macro expansion
    @ ./loading.jl:1180 [inlined]
  [7] macro expansion
    @ ./lock.jl:223 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
  [9] eval
    @ ./boot.jl:368 [inlined]
 [10] eval
    @ ./Base.jl:65 [inlined]
 [11] repleval(m::Module, code::Expr, #unused#::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:222
 [12] (::VSCodeServer.var"#107#109"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:186
 [13] with_logstate(f::Function, logstate::Any)
    @ Base.CoreLogging ./logging.jl:511
 [14] with_logger
    @ ./logging.jl:623 [inlined]
 [15] (::VSCodeServer.var"#106#108"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:187
 [16] #invokelatest#2
    @ ./essentials.jl:729 [inlined]
 [17] invokelatest(::Any)
    @ Base ./essentials.jl:726
 [18] macro expansion
    @ ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
 [19] (::VSCodeServer.var"#61#62")()
    @ VSCodeServer ./task.jl:484

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.