masonprotter / bumper.jl Goto Github PK
View Code? Open in Web Editor NEWBring Your Own Stack
License: MIT License
Bring Your Own Stack
License: MIT License
Hi!
I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C.
https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8
The test first allocates the inverse squares of integers from 1 to N, and then performs the sum.
This is how it looks like for Bumper.jl:
@inline function sumArray_bumper(N)
@no_escape begin
smallarray = alloc(Float64, N)
@turbo for i ∈ 1:N
smallarray[i] = 1.0 / i^2
end
sum = 0.0
@turbo for i ∈ 1:N
sum += smallarray[i]
end
end
return sum
end
I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory.
My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper.
Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays?
Best,
First of all: cool package and thanks for your work!
While I was working with the package I encountered an error starting with
"Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks."
so here is that bug report.
I wanted to allocate some memory for an array with an abstract eltype. Here is a minimal (not) working example:
using Bumper
abstract type MyType end
struct MyStruct <: MyType
x::Int
end
Base.sizeof(::Type{MyType}) = sizeof(Int)
@no_escape begin
foo_arr = @alloc(MyType, 10)
println(foo_arr)
end
I suppose the answer might be 'you cannot define sizeof for your abstract type and expect things to work', but I wanted to open this bug report anyway as requested.
Here is the full stack trace:
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x15f21dd2f5e -- _show_default at .\show.jl:465
in expression starting at C:\Users\wolf.nederpel\hive\intsect\scripts\random_julia_test.jl:11
_show_default at .\show.jl:465
show_default at .\show.jl:462 [inlined]
show at .\show.jl:457 [inlined]
show_delim_array at .\show.jl:1346
show_delim_array at .\show.jl:1335 [inlined]
show_vector at .\arrayshow.jl:530
show_vector at .\arrayshow.jl:515 [inlined]
show at .\arrayshow.jl:486 [inlined]
print at .\strings\io.jl:35
print at .\strings\io.jl:46
println at .\strings\io.jl:75
unknown function (ip: 0000015f21dd4f1f)
println at .\coreio.jl:4
unknown function (ip: 0000015f21dd2bcb)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_body at C:/workdir/src\interpreter.c:489
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
include_string at .\loading.jl:2070
_include at .\loading.jl:2130
include at .\client.jl:489
unknown function (ip: 0000015f21dc916b)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_stmt_value at C:/workdir/src\interpreter.c:174 [inlined]
eval_body at C:/workdir/src\interpreter.c:635
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
eval_user_input at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:150
repl_backend_loop at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:246
#start_repl_backend#46 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:231
start_repl_backend at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:228
#run_repl#59 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:389
run_repl at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:375
jfptr_run_repl_95895.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
#1013 at .\client.jl:432
jfptr_YY.1013_86694.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:812
#invokelatest#2 at .\essentials.jl:887 [inlined]
invokelatest at .\essentials.jl:884 [inlined]
run_main_repl at .\client.jl:416
exec_options at .\client.jl:333
_start at .\client.jl:552
jfptr__start_86719.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
true_main at C:/workdir/src\jlapi.c:582
jl_repl_entrypoint at C:/workdir/src\jlapi.c:731
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 635700 (Pool: 634725; Big: 975); GC: 1
I read in the README:
If you use Bumper.jl, please consider submitting a sample of your use-case so I can include it in the test suite.
Happy to share that I just added support for Bumper.jl in DynamicExpressions.jl, which means people can soon also use it for SymbolicRegression.jl and PySR.
My use-case is coded up in this file with the important part being:
function bumper_eval_tree_array(
tree::AbstractExpressionNode{T},
cX::AbstractMatrix{T},
operators::OperatorEnum,
::Val{turbo},
) where {T,turbo}
result = similar(cX, axes(cX, 2))
n = size(cX, 2)
all_ok = Ref(false)
@no_escape begin
_result_ok = tree_mapreduce(
# Leaf nodes, we create an allocation and fill
# it with the value of the leaf:
leaf_node -> begin
ar = @alloc(T, n)
ok = if leaf_node.constant
v = leaf_node.val::T
ar .= v
isfinite(v)
else
ar .= view(cX, leaf_node.feature, :)
true
end
ResultOk(ar, ok)
end,
# Branch nodes, we simply pass them to the evaluation kernel:
branch_node -> branch_node,
# In the evaluation kernel, we combine the branch nodes
# with the arrays created by the leaf nodes:
((args::Vararg{Any,M}) where {M}) ->
dispatch_kerns!(operators, args..., Val(turbo)),
tree;
break_sharing=Val(true),
)
x = _result_ok.x
result .= x
all_ok[] = _result_ok.ok
end
return (result, all_ok[])
end
Basically it's a recursive evaluation scheme for an arbitrary symbolic expression over a 2D array of data. Preliminary result show a massive performance gain with bump allocation! Even faster than LoopVectorization (though the user could even turn on both, though I don't see much more of an improvement).
The way you can write an integration test is:
using DynamicExpressions: Node, OperatorEnum, eval_tree_array
using Bumper
using Random: MersenneTwister as RNG
operators = OperatorEnum(binary_operators=(+, -, *), unary_operators=(cos, exp))
x1 = Node{Float32}(feature=1)
x2 = Node{Float32}(feature=2)
tree = cos(x1 * 0.9 - 0.5) + x2 * exp(1.0 - x3 * x3)
# ^ This is a symbolic expression described as a type-stable binary tree
# Evaluate with Bumper:
X = randn(RNG(0), Float32, 2, 1000);
truth, no_nans_truth = eval_tree_array(tree, X, operators)
test, no_nans_test = eval_tree_array(tree, X, operators; bumper=true)
@test truth ≈ test
You could also random generate expressions if you want to use this as a way to stress test the bump allocator. The code to generate trees is here
which lets you do
tree = gen_random_tree_fixed_size(20, operators, 2, Float32)
Cheers,
Miles
P.S., any tips on how I'm using bumper allocation would be much appreciated!! For example, I do know exactly how large the allocation should be in advance – can that help me get more perf at all?
Currently, Enzyme.jl's reverse mode autodiff doesn't work correctly with Bumper.jl because if you give it a Duplicated
buffer, it'll +=
accumulate results into the duplicated buffer making the answer depend on the state of the buffer at the start of the program.
It'd be good if we could set up some EnzymeRules to explicitly teach Enzyme how to handle Bumper.jl allocations and deallocations. I don't really know how to do this though, so if anyone wants to take it on, or work on it together please do.
I'm not sure there's much advantage to me letting people wrap whatever type like like for this thing. Might be better to simply do:
mutable struct AllocBuffer
ptr::Ptr{UInt8}
length::Int
offset::UInt8
end
function AllocBuffer(length::Int; finalize=true)
ptr = malloc(length)
out = AllocBuffer(ptr, length, UInt(0))
if finalize
finalizer(x -> free(x.ptr), out)
end
out
end
which'd make it more similar to SlabBuffer
. This'd be a breaking change, so I'd like to do it before 1.0 if I do it.
In ArrayAllocators.jl, I made some bindings for several allocations functions:
posix_memalign
VirtualAlloc2
VirtualAllocEx
numa_alloc_onnode
numa_alloc_local
What would be a good way to compose ArrayAllocators.jl and Bumper.jl?
When running julia with --check-bounds=no
something goes wrong with Bumper
. It should be noted in the docs.
The MWE is the example from the docs:
using Bumper
using BenchmarkTools
using StrideArrays
function f(x)
# Set up a scope where memory may be allocated, and does not escape:
@no_escape begin
# Allocate a `PtrArray` (see StrideArraysCore.jl) using memory from the default buffer.
y = @alloc(eltype(x), length(x))
# Now do some stuff with that vector:
y .= x .+ 1
sum(y) # It's okay for the sum of y to escape the block, but references to y itself must not do so!
end
end
@benchmark f(x) setup=(x = rand(1:10, 30))
Starting julia with --check-bounds=auto
I get this output:
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
Range (min … max): 19.837 ns … 41.080 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.998 ns ┊ GC (median): 0.00%
Time (mean ± σ): 20.250 ns ± 1.138 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▇▆▅▄▃▂▁ ▁▁▁ ▁ ▂
████████████████████▇▇▆█▆▆▅▅▅▆▆▆█▇▆▇▆▆▅▄▅▅▃▅▅▄▅▄▂▂▃▃▄▃▄▅▃▃▄ █
19.8 ns Histogram: log(frequency) by time 24.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
With --check-bounds=no
it is quite a bit slower, and allocating:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 147.137 μs … 4.958 ms ┊ GC (min … max): 0.00% … 95.54%
Time (median): 152.287 μs ┊ GC (median): 0.00%
Time (mean ± σ): 156.173 μs ± 87.330 μs ┊ GC (mean ± σ): 1.91% ± 3.47%
▁▁▂▅▆█▆▄▄▂▂▂▂▁▂▁
▂▁▃▄▆▇████████████████████▇▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
147 μs Histogram: frequency by time 166 μs <
Memory estimate: 49.56 KiB, allocs estimate: 1050.
Julia Version 1.12.0-DEV.606
Commit 6f569c7ba0* (2024-05-27 08:27 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
WORD_SIZE: 64
LLVM: libLLVM-17.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
JULIA_EDITOR = emacs -nw
Is it in principle possible to have some sort of a macro that would take a function and replace its inner calls to Vector{T}(undef, n)
(and similar) with @alloc(T, n)
? To me this appears possible after the calls are inlined and possibly some escape analysis applied but I have a very limited understanding of the problem.
So the macro could do something like:
function mapsum(f, x)
arr = Vector{Float64}(undef, length(x))
arr .= f.(x)
return sum(arr)
end
transforms into
function mapsum_bumpered(f, x)
@no_escape begin
arr = @alloc(Float64, length(x))
arr .= f.(x)
ans = sum(arr)
end
return ans
end
Thanks!
Hi,
I believe your package has a good track record by now, just works. Probably not many know of it.
Should it be added to Julia, so that e.g. the compiler/optimizer can use it? It seems we could compete with Mojo that way. It deallocates as fast as possible, before variables go out of scope even (in languages like C++).
A first step would even be helpful on its own:
Phase 1.
Just move unchanged, gives more visibility (could also be had by documenting in Julia's docs). Julia itself wouldn't use. But at any point if could, by uses Bumper.jl as documented.
Phase 2.
This would be up to Julia people also, and the main win with merging. Make use of already existing idiomatic Julia code in or out of Julia use Bumper.jl transparently.
I recall our discussion, but can't find it, about dynamically adding to the buffer. I see it's now Task_local (would it be per thread, or is that in effect what it is?). I mentioned a problem with dynamically enlarging, so you backed away from it and now I found a solution, but it seems redundant, with changes I see you've now already implemented. I see you now allocate 1/8th of physical memory, which seems way excessive, which I think is the point so that you never have to enlarge. You rely on the VM (and [RAM] memory not actually used just virtual memory reserved, and the OS allocating more of it transparently). So why 1/8th? Why not even larger, all of it, or smaller? I'm guessing if you have e.g. 8 threads then you allocate all, and with 16 then 2x overcommit (which is ok, at least on Linux).
I do not believe overcommitting works on Windows however, so do you know of problems, if e.g. you have very many threads? Also say Julia's with 8 threads, and 4 such Julias running at once, is that ok? I don't know about macOS, but it's likely similar. Before merging, such use would need to be confirmed ok, or if lower from 1/8th...
Precompiling project...
✗ Bumper
0 dependencies successfully precompiled in 3 seconds. 57 already precompiled.
ERROR: The following 1 direct dependency failed to precompile:
Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]
Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /home/lime/.julia/compiled/v1.8/Bumper/jl_ujDpjX.
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
[1] include
@ ./Base.jl:419 [inlined]
[2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
@ Base ./loading.jl:1554
[3] top-level scope
@ stdin:1
in expression starting at /home/lime/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1
MWE:
In a fresh REPL:
julia> using Bumper: @no_escape, @alloc
julia> using Random: randn!
julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})
julia> @no_escape begin
ar = @alloc(T, 100)
randn!(ar)
@. ar = cos(ar)
sum(ar)
end
109.13606f0 + 4.8591895f0im
However, if I import StrideArrays
, I get an error:
julia> using Bumper: @no_escape, @alloc
julia> using StrideArrays
julia> using Random: randn!
julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})
julia> @no_escape begin
ar = @alloc(T, 100)
randn!(ar)
@. ar = cos(ar)
sum(ar)
end
ERROR: MethodError: no method matching vmaterialize!(::PtrArray{…}, ::Base.Broadcast.Broadcasted{…}, ::Val{…}, ::Val{…}, ::Val{…})
Closest candidates are:
vmaterialize!(::Any, ::Any, ::Val{Mod}, ::Val{UNROLL}) where {Mod, UNROLL}
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:753
vmaterialize!(::Union{LinearAlgebra.Adjoint{T, A}, LinearAlgebra.Transpose{T, A}}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, A<:AbstractArray{T, N}, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:682
vmaterialize!(::AbstractArray{T, N}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:673
...
Stacktrace:
[1] vmaterialize!
@ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:759 [inlined]
[2] _materialize!
@ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:181 [inlined]
[3] materialize!(dest::PtrArray{…}, bc::Base.Broadcast.Broadcasted{…})
@ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:188
[4] macro expansion
@ REPL[5]:4 [inlined]
[5] macro expansion
@ ~/.julia/packages/Bumper/eoK0g/src/internals.jl:74 [inlined]
[6] top-level scope
@ REPL[5]:1
Some type information was truncated. Use `show(err)` to see complete types.
I think maybe a fallback methods should be used if it doesn't exist?
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
The basic idea is, you have slabs of some size.
When you run out of memory, you allocate a new slab.
Examples:
llvm: https://llvm.org/doxygen/Allocator_8h_source.html
LoopModels: https://github.com/JuliaSIMD/LoopModels/blob/bumprealloc/include/Utilities/Allocators.hpp
LoopModel's is largely a copy of LLVM's, but supports either a bump-up or bump-down.
LoopModel's slab size is constant, but LLVM's slabs grow.
A julia struct itself could look like
mutable struct BumpAlloc{Up,SlabSize}
current::Ptr{Cvoid}
slabend::Ptr{Cvoid}
# you could try and get fancy and reduce the number of indirection's by having your own array type
slabs::Vector{Ptr{Cvoid}}
custom_slabs::Vector{Ptr{Cvoid}}
end
# should probably register a finalizer that `Libc.free`s all the pointers
# optionally use a faster library like `mimalloc` instead of `Libc`
The custom_slabs
are for objects too big for the SlabSize
.
The point of being separate was largely because in C++ there possibly are possibly faster free/delete functions that take the size (i.e. there might exist, and they might be faster).
Given that we don't have that here, we may as well fuse them, unless you find some allocator API that supports sizes.
Being able to grow lets you default to a much smaller slab size.
I was thinking about modifying SimpleChains to use something like this.
julia> function f1()
@no_escape begin
y = @alloc(Int,10)
Threads.@threads for val in y
println(val)
end
end
end
ERROR: LoadError: The `return` keyword is not allowed to be used inside the `@no_escape` macro
I have to nest the for
loop inside a function to trick it:
julia> function _f(y)
Threads.@threads for val in y
println(y)
end
end
_f (generic function with 1 method)
julia> function f2()
@no_escape begin
y = @alloc(Int,10)
_f(y)
end
end
f2 (generic function with 1 method)
First off, interesting package! I think my issue is more with StrideArrays and LinearAlgebra not meshing well and Bumper is caught in the middle.
The error I'm getting comes from trying to use ldiv!
which requires a factorized matrix, but StridedArrays always tries to produce a PtrArray regardless of the function applied:
X = rand(100,100)
y = rand(100)
function f(X,y)
numObs, numFeatures = size(X)
T = eltype(X)
@no_escape begin
Xfact = @alloc(T, numObs, numFeatures)
b = @alloc(T, numFeatures)
ŷ = @alloc(T, numObs)
Xfact .= X
qr!(Xfact)
ldiv!(b,Xfact,y) # <-- ERROR: MethodError: no method matching ldiv!(::PtrArray{…}, ::PtrArray{…})
mul!(ŷ,X,b)
err = sum((yᵢ - ŷᵢ)^2 for (yᵢ, ŷᵢ) in zip(y,ŷ)) / numObs
end
return err
end
I'm guessing there's no easy way to avoid using PtrArrays
. I can use X\y
but this of course allocates which kind of defeats the purpose.
In the switchover from 0.3 to 0.4 when replacing the function based alloc with the macro based version I always got the error that the @alloc is not within a @no_escape block even though it obviously was.
turns out the problem was my usage as Bumper.@alloc instead of just @alloc. From what I can see from a quick glance over the code the replacement code is looking explicitly for an @alloc. Perhaps this could bei widened?
Dear developers,
I found that the following code gives rise to a stack overflow:
using Bumper
using LinearAlgebra
function trial(x)
@no_escape begin
T = @alloc(eltype(x), 2, 2)
T .= 0
T[1,1] = x
T[2,2] = x
eigval, eigvects = eigen(T)
sum(eigval)
end
end
julia> trial(2)
Generates the following error:
ERROR: StackOverflowError:
Stacktrace:
[1] AbstractPtrArray
@ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:199 [inlined]
[2] AbstractPtrArray
@ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:456 [inlined]
[3] AbstractPtrArray
@ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:481 [inlined]
[4] view(A::StrideArraysCore.PtrArray{Int64, 2, (1, 2), Tuple{Int64, Int64}, Tuple{Nothing, Nothing}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, i::StepRange{Int64, Int64}) (repeats 79984 times)
@ StrideArraysCore ~/.julia/packages/StrideArraysCore/VyBzA/src/stridearray.jl:263
Am I using Bumper in the wrong way? My understanding is that the memory allocated inside @no_escape should not escape the block. Still, here, the block returns a scalar reduction of the allocated array, so the memory should not escape.
Is there another way to diagonalize a matrix allocated on the Bumper stack?
EDIT:
Also, the error occurs in the line that calls eigen(T)
.
I see that you want nothrow for StaticCompiler.jl, but there are some problems.
It will overwrite memory if you're not careful. I'm thinking you may want to check if the buffer is to small, and then there might be a way to rather just exit the program? I think you can print something on stderr first, and then exit(1), or is there some PANIC, similar to in Go?
While alloc_nothrow works in regular Julia, just not vice versa, why it exists, I think the functionality above could be folded into the regular alloc. If you really need to use the other Malloc, could you use that in all cases? It means an extra dependency on the other package, or maybe rather use Libc.malloc directly? You can use Libc.realloc, and then you need to use the best growing strategy yourself, but you already have one.
I'm not sure what using Julia's regular Vector buys you, then it will be tracked by Julia's GC, probably a minimal slowdown though, with no benefit, since you don't want your buffers reclaimed anyway. And it's just an array of bytes, can't contain pointers to other objects. Or actually it may be possible, but then will not be be considered by the GC anyway.
Hi,
I have the following example where I observe a 2x slowdown with Bumper.alloc!
.
Could you please confirm that I use the package correctly?
Do you have ideas on how to fix this?
Thank you !
using BenchmarkTools, Bumper
function work0(polys; use_custom_allocator=false)
if use_custom_allocator
custom_allocator = Bumper.SlabBuffer()
@no_escape custom_allocator begin
work1(polys, custom_allocator)
end
else
work1(polys)
end
end
# Very important work
function work1(polys, custom_allocator=nothing)
res = 0
for poly in polys
new_poly = work2(poly, custom_allocator)
res += sum(new_poly)
end
res
end
function work2(poly::Vector{T}, ::Nothing) where {T}
new_poly = Vector{T}(undef, length(poly))
work3!(new_poly)
end
function work2(poly::Vector{T}, custom_allocator) where {T}
new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
work3!(new_poly)
end
function work3!(poly::AbstractVector{T}) where {T}
poly[1] = one(T)
for i in 2:length(poly)
poly[i] = convert(T, i)^3 - poly[i - 1]
end
poly
end
###
m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];
@btime work0(polys, use_custom_allocator=false)
# 6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19
@btime work0(polys, use_custom_allocator=true)
# 14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19
Running on
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
We should provide an easy API for setting a checkpoint and returning to it that doesn't require indented blocks.
I still get precompilation error with all the Octavian specified version and StrideArrays added. Is it on my end or pkg related ?
[ Info: Precompiling Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
[1] include
@ ./Base.jl:419 [inlined]
[2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_de
ps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing) @ Base ./loading.jl:1554
[3] top-level scope
@ stdin:1
in expression starting at /Users/usr/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /Users/usr/.julia/compiled/v1.8/Bumper/jl_LpueaC.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
@ Base ./loading.jl:1707
[3] compilecache
@ ./loading.jl:1651 [inlined]
[4] _require(pkg::Base.PkgId)
@ Base ./loading.jl:1337
[5] _require_prelocked(uuidkey::Base.PkgId)
@ Base ./loading.jl:1200
[6] macro expansion
@ ./loading.jl:1180 [inlined]
[7] macro expansion
@ ./lock.jl:223 [inlined]
[8] require(into::Module, mod::Symbol)
@ Base ./loading.jl:1144
[9] eval
@ ./boot.jl:368 [inlined]
[10] eval
@ ./Base.jl:65 [inlined]
[11] repleval(m::Module, code::Expr, #unused#::String)
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:222
[12] (::VSCodeServer.var"#107#109"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:186
[13] with_logstate(f::Function, logstate::Any)
@ Base.CoreLogging ./logging.jl:511
[14] with_logger
@ ./logging.jl:623 [inlined]
[15] (::VSCodeServer.var"#106#108"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
@ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:187
[16] #invokelatest#2
@ ./essentials.jl:729 [inlined]
[17] invokelatest(::Any)
@ Base ./essentials.jl:726
[18] macro expansion
@ ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
[19] (::VSCodeServer.var"#61#62")()
@ VSCodeServer ./task.jl:484
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.