meggart / diskarrays.jl Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Writing with GDAL to a non-chunked file like ".rst" from a chunked file like ".tif" gives the dreaded
Chunks do not align in dimension 2
error.
This is an unfortunate effect of the recent changes. I'm not sure of the exact reason.
Indexing with a vector can be incredibly slow with the current implementation of batchgetindex
.
From this comment by @Alexander-Barth
#131 (comment)
NetCDF.jl + DiskArray
julia> using NetCDF
julia> v = NetCDF.open("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0","water_temp");
julia> size(v)
(4500, 4251, 40, 14969)
julia> v1 = @time v[1000,1000,1,1:10];
1.478774 seconds (3.71 M allocations: 250.669 MiB, 3.42% gc time, 59.32% compilation time)
julia> v1 = @time v[1000,1000,1,1:10];
0.467219 seconds (45 allocations: 1.781 KiB)
julia> v1 = @time v[1000,1000,1,collect(1:10)];
739.847348 seconds (1.32 M allocations: 819.944 MiB, 0.03% gc time, 0.10% compilation time)
From this comment by @Alexander-Barth
https://github.com/meggart/DiskArrays.jl/issues/131#issuecomment-1902759762
NCDatasets 0.12 without DiskArray
julia> using NCDatasets
julia> ds = NCDataset("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0"); v = ds["water_temp"];
julia> v1 = @time v[1000,1000,1,1:10];
0.726571 seconds (134.15 k allocations: 8.849 MiB, 14.90% compilation time)
julia> v1 = @time v[1000,1000,1,1:10];
0.218793 seconds (62 allocations: 1.984 KiB)
julia> v1 = @time v[1000,1000,1,collect(1:10)];
0.879592 seconds (982.55 k allocations: 66.829 MiB, 9.49% gc time, 65.79% compilation time)
So 12 minutes versus 1 seconds for this example. Yes, I know that I can bypass DiskArrays by providing my own batchgetindex. But I don't think that DiskArray should make the current assumptions if it aims to be generic (if this is the goal).
Can we have the current assumptions (i.e. reading a chunk similarly fast than reading a subset of chunk) documented?
Concerning the API, another think I am wondering if we need a function named batchgetindex at all. Why not having getindex simply pass the indices to
DiskArrays.readblock!
and letDiskArrays.readblock!
figure out how to best load the data.DiskArrays.readblock!
is already specific to the storage format.
I do not have a reproducible example at hand, but there is a problem with using the default Array
constructor on chunked AbstractDiskArrays
. Adding the following method fixed the code:
function Array(a::AbstractDiskArray)
aout = similar(a)
aout .= a
aout
end
but it is worth to investigate and add some unit tests when fixing this.
We could define a DiskGenerator
object with a method for collect
that properly uses eachindex
to populate the in-memory array in the right order.
While debugging an issue in NCDatasets using DiskArrays v0.3.20 ( Alexander-Barth/NCDatasets.jl#230 )
I found some behavior of NetCDF.jl with DiskArray with variables of unlimited dimensions that I could not understand.
While the following code with a 2D array (one unlimited) works without error:
import NetCDF
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NetCDF.NcVar("obs",[lon_dim,time_dim]);
NetCDF.create(fname2,v) do nc
nc["obs"][:,1:3] = rand(3,3)
end;
The following code with a single unlimited dimensions fails:
import NetCDF
var = collect(1:4)
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
v = NetCDF.NcVar("var",[time_dim]);
NetCDF.create(fname2,v) do nc
nc["var"][1:4] = var
end;
with the following error message:
ERROR: BoundsError: attempt to access 0-element CartesianIndices{1, Tuple{Base.OneTo{Int64}}} at index [1:4]
Stacktrace:
[1] throw_boundserror(A::CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, I::Tuple{UnitRange{Int64}})
@ Base ./abstractarray.jl:744
[2] checkbounds
@ ./abstractarray.jl:709 [inlined]
[3] getindex
@ ./multidimensional.jl:358 [inlined]
[4] view
@ ./multidimensional.jl:372 [inlined]
[5] interpret_indices_disk
@ ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:137 [inlined]
[6] setindex_disk!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
@ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:56
[7] setindex!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
@ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:222
[8] (::var"#15#16")(nc::NcFile)
@ Main ./REPL[54]:2
[9] create(::var"#15#16", ::String, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1022
[10] create(::Function, ::String, ::NcVar{Float64, 1, 6})
@ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1019
[11] top-level scope
@ REPL[54]:1
NetCDF.jl and the equivalent NCDatasets.jl code both fail at the same line:
https://github.com/meggart/DiskArrays.jl/blob/v0.3.20/src/diskarray.jl#L137
Ref:
#53
This is because subsetchunks
is specifically only written for reverse
Lines 43 to 50 in d0da1e0
Should we just subset it as if there is not step? will that work?
Edit: it doesn't.
Maybe the batch PR can work for this as well? views like this are very useful for reducing the size of (very) large rasters for plotting. Rasters.jl has been using this prior to 0.3.
Using the setup code from the tests (below) this chunked 2*2 broadcast with a vector seems broken:
julia> a_disk = _DiskArray([1 10; 2 20; 3 30; 4 40], chunksize=(2, 2))
Disk Array with size 4 x 2
julia> a_disk .* [1, 2, 3, 4] |> collect
4×2 Matrix{Int64}:
1 9
4 16
10 90
40 160
julia> [1 10; 2 20; 3 30; 4 40] .* [1, 2, 3, 4]
4×2 Matrix{Int64}:
1 10
4 40
9 90
16 160
We get the chunks as columns. The same problem occurs for a Tuple, I found this trying to add Tuple support in #39.
Setup code:
using DiskArrays
using DiskArrays: ReshapedDiskArray, PermutedDiskArray
struct _DiskArray{T,N,A<:AbstractArray{T,N}} <: AbstractDiskArray{T,N}
getindex_count::Ref{Int}
setindex_count::Ref{Int}
parent::A
chunksize::NTuple{N,Int}
end
_DiskArray(a;chunksize=size(a)) = _DiskArray(Ref(0),Ref(0),a,chunksize)
Base.size(a::_DiskArray) = size(a.parent)
DiskArrays.haschunks(::_DiskArray) = DiskArrays.Chunked()
DiskArrays.eachchunk(a::_DiskArray) = DiskArrays.GridChunks(a,a.chunksize)
getindex_count(a::_DiskArray) = a.getindex_count[]
setindex_count(a::_DiskArray) = a.setindex_count[]
trueparent(a::_DiskArray) = a.parent
getindex_count(a::ReshapedDiskArray) = getindex_count(a.parent)
setindex_count(a::ReshapedDiskArray) = setindex_count(a.parent)
trueparent(a::ReshapedDiskArray) = trueparent(a.parent)
getindex_count(a::PermutedDiskArray) = getindex_count(a.a.parent)
setindex_count(a::PermutedDiskArray) = setindex_count(a.a.parent)
trueparent(a::PermutedDiskArray{T,N,<:PermutedDimsArray{T,N,perm,iperm}}) where {T,N,perm,iperm} = permutedims(trueparent(a.a.parent),perm)
function DiskArrays.readblock!(a::_DiskArray,aout,i::AbstractUnitRange...)
ndims(a) == length(i) || error("Number of indices is not correct")
all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
#println("reading from indices ", join(string.(i)," "))
a.getindex_count[] += 1
aout .= a.parent[i...]
end
function DiskArrays.writeblock!(a::_DiskArray,v,i::AbstractUnitRange...)
ndims(a) == length(i) || error("Number of indices is not correct")
all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
#println("Writing to indices ", join(string.(i)," "))
a.setindex_count[] += 1
view(a.parent,i...) .= v
end
I had some ideas for possible additions:
It might be cool for something like arr = SomeDiskArray{Float64}(file_handle, undef, 100, 10)
to be part of the interface. This would make it possible to treat a DiskArray almost exactly like an Array. Functions like map would "just" work. May be even go so far as to allow arr = SomeDiskArray{Float64}(undef, 100, 10)
where the file object is automatically created in the current working dir with some temp name.
If a DiskArray implements the resize!
method, we can have an automatic append!
be defined for any DiskArray.
For concatenation, we can define a cat!(out, A..., dims=dims)
function. If out
can be resized, it can be any size, if it cannot, the size of out
must be the sum of the sizes of all A
. May be use a Trait to specify whether or not a DiskArray can be resized. If we have initialisation defined, we can even have cat
work like it does for normal arrays. It'll just initialise a new array and write into it.
Batchindex with batches is not typesafe at the moment.
julia> @code_warntype c.data[[(1, 1), (1, 2)], :, :]
MethodInstance for getindex(::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}, ::Vector{Tuple{Int64, Int64}}, ::Colon, ::Colon)
from getindex(a::DiskArrays.AbstractDiskArray, i...) in DiskArrays at /home/gkraemer/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:174
Arguments
#self#::Core.Const(getindex)
a::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}
i::Tuple{Vector{Tuple{Int64, Int64}}, Colon, Colon}
Body::Any
1 ─ %1 = Core.tuple(a)::Tuple{DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}}
│ %2 = Core._apply_iterate(Base.iterate, DiskArrays.getindex_disk, %1, i)::Any
└── return %2
In NCDatasets 0.12.17 I had optimization for var[:,:,[1,2]]
converting it to var[:,:,1:2]
and
var[:,:,[1,2,10,11]]
into two load operation v[:,:,1:2]
and v[:,:,10:11]
. This was implemented here:
https://github.com/Alexander-Barth/NCDatasets.jl/blob/v0.12.17/src/cfvariable.jl#L359-L436
Now a call to v[:,:,[1,2]]
loads every element individually (which quite slow).
In fact, the relevant code is still in NDatasets 0.13 but no longer called.
Maybe this functionality should go into DiskArrays?
Can we work around this? This being an error is a problem for using chunked broadcast as a default.
I'm not totally across the underlying mechanisms yet, but I can imagine 2 possibilities:
Then the error could be a warning.
Equality of an array and a view of a DiskArray based Raster
has broken with 0.3 when the array is Unchunked
:
using Rasters
url = "https://download.osgeo.org/geotiff/samples/gdal_eg/cea.tif"
path = download(url)
A = Raster(path)
view(A, 1, 2:3, 1) == [0x00, 0x6b]
Gives the error:
ERROR: MethodError: Cannot `convert` an object of type Tuple{UnitRange{Int64}} to an object of type Int64
Closest candidates are:
convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at /opt/julia/share/julia/base/twiceprecision.jl:262
convert(::Type{T}, ::AbstractChar) where T<:Number at /opt/julia/share/julia/base/char.jl:185
convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at /opt/julia/share/julia/base/multidimensional.jl:136
...
Stacktrace:
[1] DiskArrays.RegularChunks(cs::Tuple{UnitRange{Int64}}, offset::Int64, s::Int64)
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:17
[2] (::DiskArrays.var"#34#37")(s::Int64, cs::Tuple{UnitRange{Int64}}, of::Int64)
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:101
[3] (::Base.var"#4#5"{DiskArrays.var"#34#37"})(a::Tuple{Int64, Tuple{UnitRange{Int64}}, Int64})
@ Base ./generator.jl:36
[4] iterate
@ ./generator.jl:47 [inlined]
[5] collect(itr::Base.Generator{Base.Iterators.Zip{Tuple{Tuple{Int64}, DiskArrays.GridChunks{1}, Tuple{Int64}}}, Base.var"#4#5"{DiskArrays.var"#34#37"}})
@ Base ./array.jl:724
[6] map
@ ./abstractarray.jl:2948 [inlined]
[7] DiskArrays.GridChunks(a::Tuple{Int64}, chunksize::DiskArrays.GridChunks{1}; offset::Tuple{Int64})
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:100
[8] #GridChunks#28
@ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
[9] GridChunks
@ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
[10] eachchunk_view(#unused#::DiskArrays.Unchunked, a::SubArray{UInt8, 1, FileArray{GDALfile, UInt8, 3, Nothing, DiskArrays.GridChunks{3}, DiskArrays.Unchunked}, Tuple{Int64, UnitRange{Int64}, Int64}, false})
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:35
[11] eachchunk(a::DiskArrays.SubDiskArray{UInt8, 1})
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:26
[12] iterate(a::DiskArrays.SubDiskArray{UInt8, 1})
@ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/iterator.jl:4
[13] iterate
@ ~/.julia/dev/DimensionalData/src/array/array.jl:66 [inlined]
[14] _zip_iterate_some
@ ./iterators.jl:358 [inlined]
[15] _zip_iterate_all
@ ./iterators.jl:350 [inlined]
[16] iterate
@ ./iterators.jl:340 [inlined]
[17] ==(A::Raster{UInt8, 1, Tuple{Y{Projected{Float64, LinRange{Float64, Int64}, ReverseOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, Y{Colon}}}}, Tuple{X{Projected{Float64, LinRange{Float64, Int64}, ForwardOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, X{Colon}}}, Band{Categorical{Int64, UnitRange{Int64}, ForwardOrdered, NoMetadata}}}, DiskArrays.SubDiskArray{UInt8, 1}, Symbol, Metadata{GDALfile, Dict{Symbol, Any}}, Nothing}, B::Vector{UInt8})
@ Base ./abstractarray.jl:2539
[18] top-level scope
@ REPL[125]:1
@meggart if you know what is going on with this stacktrace that would be helpful
As pointed out in #15 broadcast can do pretty much everything. But one issue is you have to call collect
and allocate an array?
In GeoData.jl I'm using broadcast to iterate over arrays to e.g. calculate ranges of non nodata values for a trim
method, using broadcast
so the operation is chunked. But I still need to call collect
on the broadcast object, which allocates.
It would be good if there was a way to iterate over the object with chunking that didn't allocate. Maybe that's possible already?
Would ChunkCachedArrays.jl be a better name?
The links of the documentation badges go to 404 errors
This doesn't occur when the DiskArray is shown on its own. This may be from alternate methods of show
being called for an array of array. We may need to define show without a MIME
type?
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
As a follow up of the discussion (Alexander-Barth/NCDatasets.jl#79), I would like to know if unlimited dimensions (i.e. growable) could be supported by DiskArrays. I hope you it is OK for you that I make a separate issue here. It is easier for me to track what is already done and what still needs to be discussed.
The last proposition by Fabian has been:
Maybe it would help if I added a keyword argument checkbounds=false to the DiskArrays setindex implementation. So when you call setindex! in your DimensionallArrays code you can explicitly enforce bounds checking. In addition we can still discuss what the default behavior for any DiskArray should be...
I think this would be a good idea.
One could also annotate the type DiskArray
with an additional parameter whether its size is fixed or not:
using Base: setindex!, size
struct FixedSize; end
struct MyArray{T,N,TFS} <: AbstractArray{T,N}; end
Base.size(A::MyArray) = (10,)
function fixedsize(::MyArray{T,N,TFS}) where {T,N,TFS}
return TFS == FixedSize
end
function Base.setindex!(A::MyArray,value,i::Int)
if fixedsize(A) && (i > 10)
error("you can't do that: the array has a fixed size")
end
println("ok ",i)
end
# growable array
A = MyArray{Int,1,Nothing}()
A[5] = 1
A[15] = 1
# fixed-size array
Af = MyArray{Int,1,FixedSize}()
Af[5] = 1
Af[15] = 1 # error
It is decided during creation of a DiskArray if the array can grow or not and all subsequent calls can use A[index] = value
rather than setindex(A,value,index,checkbounds=true/false)
. The compile should be able to specialize for fixed and growable arrays (there could even be two definitions of setindex!
for the two cases).
In regular Julia array, it is possible to index a N
-dimensional array with more than N
indices (see the documentation). When using an AbstractVector
for additional indices, it fails with a DiskArray
, which is inconsistent with the behaviour of Julia arrays:
Julia arrays:
julia> A = [8,6,7];
julia> A[2, [1]] # or A[2, 1:1]
1-element Vector{Int64}:
6
_DiskArray
from the tests:
julia> DA = _DiskArray(A);
julia> DA[2, [1]] # or DA[2, 1:1]
ERROR: BoundsError: attempt to access 3-element _DiskArray{Int64, 1, Vector{Int64}} at index [2, [1]]
Stacktrace:
[1] #7
@ ~/.julia/dev/DiskArrays/src/diskarray.jl:92 [inlined]
[2] foreach
@ ./abstractarray.jl:2712 [inlined]
[3] interpret_indices_disk(A::_DiskArray{Int64, 1, Vector{Int64}}, r::Tuple{Int64, Vector{Int64}})
@ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:91
[4] getindex_disk(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vararg{Any})
@ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:29
[5] getindex(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vector{Int64})
@ Main ~/.julia/dev/DiskArrays/src/diskarray.jl:178
[6] top-level scope
@ REPL[56]:1
Note that there's no issue when indexing with an integer (DA[2, 1]
works fine)
This issue originated from here
cc @rafaqz
Failing here for a broadcast between 2 and 3 dimensional arrays:
https://github.com/meggart/DiskArrays.jl/blob/master/src/ops.jl#L60
ERROR: BoundsError
Stacktrace:
[1] getindex
@ ./number.jl:96 [inlined]
[2] #112
@ ~/.julia/dev/DiskArrays/src/ops.jl:61 [inlined]
[3] map(f::DiskArrays.var"#112#120", t::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64})
@ Base ./tuple.jl:215
[4] common_chunks(::Tuple{Int64, Int64, Int64}, ::GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, ::Vararg{Any, N} where N)
@ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:61
[5] eachchunk(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
@ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:22
[6] iterate(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
@ DiskArrays ~/.julia/dev/DiskArrays/src/iterator.jl:4
[7] _all
Because at that point:
tt = ((3856, 0), (1624, 0), 1)
I'm not quite sure what the solution is.
I'm implementing a DiskArrays.jl interface in GeoData.jl. But it doesn't keep files open - they are opened when data is required, and closed again afterwards.
The issue is passing opened files to DiskArrays to apply chunked read/write e.g. during broadcast, and closing the file again afterwards. Using closures or some kind of callback function that runs after all chunks are processed should help here - but I'm not sure exactly where that should go.
Linear indexing doesn't seem to work with an AbstrctDiskArray
. Is that intentional or do I need to add something to the interface? Here FileArray <: AbstractDiskArray
julia> A[1]
ERROR: BoundsError: attempt to access 267×178×1 GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}} at index [1]
Stacktrace:
[1] interpret_indices_disk
@ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:118 [inlined]
[2] getindex_disk(a::GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, i::Int64)
@ DiskArrays ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:58
[3] getindex
@ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:186 [inlined]
[4] getindex(A::GeoArray{Int32, 3, Tuple{X{LinRange{Float64}, Projected{Ordered{ForwardIndex, ForwardArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Y{LinRange{Float64}, Projected{Ordered{ReverseIndex, ReverseArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Band{UnitRange{Int64}, Categorical{Ordered{ForwardIndex, ForwardArray, ForwardRelation}}, NoMetadata}}, Tuple{}, GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, Symbol, Metadata{GeoData.GDALfile, Dict{Symbol, Any}}, Float64}, i::Int64)
@ DimensionalData ~/.julia/dev/DimensionalData/src/indexing.jl:22
[5] top-level scope
@ REPL[56]:1
According to the documentation of cat
it should be possible to call it with an iterable of dimensions so that the "the positions along these dimensions are increased
simultaneously for each input array, filling with zero elsewhere. This allows one to construct
block-diagonal matrices as cat(matrices...; dims=(1,2)), and their higher-dimensional analogues."
We should try to adhere to this interface and I am wondering whether we would need to have a new type ConcatBlockDiskArray to overload the writeblock and readblock functions.
This part of the cat interface is used for example in DimensionalData. Therefore we can't currently use cat of a DiskArray backed AbstractDimArray.
This is low priority issue for me, but when DiskArrays are used for memory arrays, the it might be useful to adapt the default chunksize for transposed arrays:
julia> DiskArrays.default_chunk_size[] = 1
1
julia> eachchunk(ones(4000,4000))
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
(1:4000, 1:31) (1:4000, 32:62) (1:4000, 63:93) … (1:4000, 4000:4000)
julia> eachchunk(ones(4000,4000)')
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
(1:4000, 1:31) (1:4000, 32:62) (1:4000, 63:93) … (1:4000, 4000:4000)
For a transposed array I would expect that the first chunk size of (1:31, 1:4000)
. I guess one can use stride
to check this.
I am note sure if OffsetArray
is supposed to be sported but currently all indices from eachchunk
are ones-based
using DiskArrays, OffsetArray
DiskArrays.default_chunk_size[] = 1
OA = OffsetArray(ones(4000,4000),(-1,-1));
c = eachchunk(OA);
OA[last(c)...]
# ERROR: BoundsError: attempt to access 4000×4000 OffsetArray(::Matrix{Float64}, 0:3999, 0:3999) with eltype Float64 with indices 0:3999×0:3999 at index [1:4000, 4000:4000]
None of these issue affect me currently in any production code, but I thought it might be useful to notify you about it.
I am using DiskArrays v0.3.22
Hi @rafaqz ,
as discussed on discourse you showed some interest in defining something like a AbstractDiskArray
I am not sure if you have started something like this already, but since I am trying to adapt the indexing behavior of NetCDF.jl and Zarr.jl and want to avoid double work I started this package which might be merged with your efforts. The basic idea is that all a user has to provide is a function that reads a cartesian subrange from a array that is on disk and this package takes care of the remaining index details, translating trailing and missing singleton dimensions etc.
Currently this supports getindex
only, but you might get an idea of what my idea for such a package could be. Your opinion would be highly appreciated.
Next I would continue to remove the getindex implementations from Zarr and NetCDF and instead use this interface here, so that any indexing improvement would help both packages.
Here a minimal working example:
using Pkg
Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")
using Distributed
println(myid()," ", DEPOT_PATH)
addprocs(10)
@everywhere begin
using Pkg
Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")
using YAXArrays
using Zarr
function simple_function(out1, cube_in1, cube_in2)
#=
out1 .= 1
=#
end
end
lcc_final_cube = open_dataset("/Net/Groups/BGI/scratch/dpabon/lcc_final_cube_2002_2011_8d-0.083deg.zarr")
lcc_final_cube = Cube(lcc_final_cube)
lcc_classes = ["Evergreen_Broadleaf_Forests",
"Deciduous_Broadleaf_Forests",
"Mixed_Forest",
"Savannas",
"Grasslands",
"Croplands"]
lcc_small = lcc_final_cube[lon =(-78, -68), lat = (-5, 15), time = 2002:2003, classes = lcc_classes]
lst_small = open_dataset("/Net/Groups/BGI/scratch/dpabon/lst_small.zarr")
lst_small = Cube(lst_small)
renameaxis!(lcc_small, "lon" => getAxis("lon", lst_small))
renameaxis!(lcc_small, "lat" => getAxis("lat", lst_small))
indims_cube_1 = InDims("time")
indims_cube_2 = InDims("time", "classes")
out_1_dims = OutDims("time", CategoricalAxis("summary_stat", ["rsquared"]))
mapCube(simple_function, (lst_small, lcc_small), indims = (indims_cube_1, indims_cube_2), outdims = out_1_dims)
produce:
ERROR: On worker 2:
BoundsError: attempt to access 20×6×120×45 Array{Float32, 4} at index [1:20, 1, 1225:1296, 901:945]
Stacktrace:
[1] throw_boundserror
@ ./abstractarray.jl:703
[2] checkbounds
@ ./abstractarray.jl:668 [inlined]
[3] _setindex!
@ ./multidimensional.jl:929 [inlined]
[4] setindex!
@ ./abstractarray.jl:1344 [inlined]
[5] filldata!
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:127
[6] #disk_getindex_batch!#38
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:113
[7] disk_getindex_batch
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:121
[8] batchgetindex
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:72
[9] getindex_disk
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:28
[10] getindex
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
[11] readblock!
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrayTools/WsUY6/src/DiskArrayTools.jl:250
[12] readblock!
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/subarray.jl:25
[13] getindex_disk
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:31
[14] getindex
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
[15] updatear
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:639
[16] #78
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:571
[17] #59
@ ./tuple.jl:556 [inlined]
[18] BottomRF
@ ./reduce.jl:81 [inlined]
[19] _foldl_impl
@ ./reduce.jl:62
[20] foldl_impl
@ ./reduce.jl:48 [inlined]
[21] mapfoldl_impl
@ ./reduce.jl:44 [inlined]
[22] #mapfoldl#259
@ ./reduce.jl:170 [inlined]
[23] #foldl#260
@ ./reduce.jl:193 [inlined]
[24] foreach
@ ./tuple.jl:556 [inlined]
[25] updatears
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:568
[26] updateinars
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:656
[27] #107
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:700
[28] fnew
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:665
[29] #56
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1016
[30] #invokelatest#2
@ ./essentials.jl:729
[31] invokelatest
@ ./essentials.jl:726
[32] #110
@ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285
[33] run_work_thunk
@ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:70
[34] macro expansion
@ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285 [inlined]
[35] #109
@ ./task.jl:484
Stacktrace:
[1] (::Base.var"#939#941")(x::Task)
@ Base ./asyncmap.jl:177
[2] foreach(f::Base.var"#939#941", itr::Vector{Any})
@ Base ./abstractarray.jl:2774
[3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::DiskArrays.GridChunks{2})
@ Base ./asyncmap.jl:177
[4] wrap_n_exec_twice
@ ./asyncmap.jl:153 [inlined]
[5] #async_usemap#924
@ ./asyncmap.jl:103 [inlined]
[6] #asyncmap#923
@ ./asyncmap.jl:81 [inlined]
[7] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
@ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
[8] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2})
@ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99
[9] macro expansion
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1015 [inlined]
[10] macro expansion
@ ./task.jl:454 [inlined]
[11] macro expansion
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1014 [inlined]
[12] macro expansion
@ ./task.jl:454 [inlined]
[13] progress_map(::Function, ::Vararg{Any}; mapfun::typeof(pmap), progress::ProgressMeter.Progress, channel_bufflen::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ ProgressMeter /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1007
[14] #progress_pmap#60
@ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1032 [inlined]
[15] pmap_with_data(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; initfunc::Function, progress::ProgressMeter.Progress, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:668
[16] pmap_with_data(f::Function, c::DiskArrays.GridChunks{2}; initfunc::Function, kwargs::Base.Pairs{Symbol, ProgressMeter.Progress, Tuple{Symbol}, NamedTuple{(:progress,), Tuple{ProgressMeter.Progress}}})
@ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:673
[17] runLoop(dc::YAXArrays.DAT.DATConfig{2, 1}, showprog::Bool)
@ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:698
[18] mapCube(::typeof(simple_function), ::Tuple{YAXArray{Union{Missing, Float32}, 3, DiskArrayTools.CFDiskArray{Float32, 3, Float32, DiskArrays.SubDiskArray{Float32, 3}}, Vector{RangeAxis}}, YAXArray{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 4}, Vector{CubeAxis}}}; max_cache::Float64, indims::Tuple{InDims, InDims}, outdims::OutDims, inplace::Bool, ispar::Bool, debug::Bool, include_loopvars::Bool, showprog::Bool, irregular_loopranges::Bool, nthreads::Dict{Int64, Int64}, loopchunksize::Dict{Any, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:475
[19] top-level scope
@ ~/dpabon/nfdi4earth_oemc/bin/julia/exploratory_analysis/minimal_example_batch_issue.jl:50
What is the most efficient way to access discrete columns of a zarr array? This follows on from the conversation:
here
Lets say I have a Zarr array that looks like this:
z = rand(100,100,1000)
and let's pretend that z is a chunked Zarr array that lives on S3
Now if I want to get a series of discrete columns from z
:
z[3,5,:]
z[5,9,:]
z[10,6,:]
Is there a way I can do this that is efficient and does not read in the same chunk of data more than once?
Currently both of these work on DiskArrays (because) of the AbstractArray
subtype, but do not get the AbstractDiskArrays behavior. So one would have to define getindex
, setindex!
and eachchunk
in a similar way as is done in subarrays.jl for ReshapedArray{<:Any,<:Any,<:AbstractDiskArray}
and and PermutedDimsArray{<:Any,<:Any,<:AbstractDiskArray}
As already discussed in #2 before registering this package I would like to decide about the org where the repo should be hosted. It was suggested that JuliaArrays would be a good fit, it is not obvious who to contact so I am pinging @mbauman who was involved in the initial discourse discussion, if he can comment if this package would meet the criteria to be included in JuliaArrays.
When I want to convert my DiskArray into a normal memory Array, the conversion with Array is very slow compared to the getindex acess.
Is this something that we could fix in DiskArrays or is there a better method to do the conversion?
julia> @time palfreqdata = palfreq[:,:]
0.072899 seconds (157 allocations: 166.524 MiB)
5820×7500 Matrix{Float32}:
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 … 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
⋮ ⋱ ⋮
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
julia> @time Array(palfreq.data)
9.528337 seconds (218.25 M allocations: 14.635 GiB, 21.90% gc time)
5820×7500 Matrix{Float32}:
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 … 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
⋮ ⋱ ⋮
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32 1.0f32
We can likely get around this problem by skipping batchgetindex
for small indexing operations. We may also need to facilitate easy backend customisation of this behaviour.
Originally posted by @Alexander-Barth in #131 (comment)
In the worst-case scenario (reading a time series for spatial chunked data), one would read a single value from each chunk:
Test with NCDatasets 0.12.17:
using BenchmarkTools
using NCDatasets
sz = (1024,1024,64)
ds = NCDataset("/tmp/tmp.nc","c")
defDim(ds,"lon",sz[1])
defDim(ds,"lat",sz[2])
defDim(ds,"time",sz[3])
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)
ds["data"][:] = randn(sz);
ds["datac"][:] = randn(sz);
@btime ds["data"][1,1,1];
@btime ds["data"][:,:,1];
@btime ds["datac"][1,1,1];
@btime ds["datac"][:,:,1];
output:
julia> @btime ds["data"][1,1,1];
15.069 μs (34 allocations: 1.62 KiB)
julia> @btime ds["data"][:,:,1];
1.014 ms (35 allocations: 8.00 MiB)
julia> @btime ds["datac"][1,1,1];
13.518 μs (34 allocations: 1.62 KiB)
julia> @btime ds["datac"][:,:,1];
470.025 μs (35 allocations: 8.00 MiB)
So reading an entire chunk (if you do not need it) is 67 times slower for uncompressed data and 36 times slower for compressed data. But we read the same data multiple times in @btime
.
The netCDF library is likely cache the data. Some application might indeed read the same data multiple times but this is less typical. If the data is read for the first time, the differences are not as large:
sz = (1024,1024,64)
isfile("/tmp/tmp3.nc") && rm("/tmp/tmp3.nc")
ds = NCDataset("/tmp/tmp3.nc","c")
defDim(ds,"lon",sz[1])
defDim(ds,"lat",sz[2])
defDim(ds,"time",sz[3])
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)
ds["data"][:] = randn(sz);
ds["datac"][:] = randn(sz);
close(ds)
ds = NCDataset("/tmp/tmp3.nc")
# warm-up
ds["data"][1,1,1];
ds["data"][:,:,2];
@time ds["data"][1,1,3];
@time ds["data"][:,:,4];
@time ds["datac"][1,1,3];
@time ds["datac"][:,:,4];
output
0.001231 seconds (34 allocations: 1.625 KiB)
0.001682 seconds (35 allocations: 8.002 MiB)
0.031734 seconds (34 allocations: 1.625 KiB)
0.038516 seconds (35 allocations: 8.002 MiB)
So a difference of about 20 % in this case.
For:
@time ds["data"][512,512,3];
@time ds["data"][:,:,4];
@time ds["datac"][512,512,3];
@time ds["datac"][:,:,4];
I get (loading the first time)
0.026237 seconds (36 allocations: 1.656 KiB)
0.045077 seconds (35 allocations: 8.002 MiB)
0.033074 seconds (36 allocations: 1.656 KiB)
0.038558 seconds (35 allocations: 8.002 MiB)
If the data is in the libnetcdf chunk cache (using @btime
), I get:
julia> @btime ds["data"][512,512,3];
15.549 μs (36 allocations: 1.66 KiB)
julia> @btime ds["data"][:,:,4];
421.606 μs (35 allocations: 8.00 MiB)
julia> @btime ds["datac"][512,512,3];
15.361 μs (36 allocations: 1.66 KiB)
julia> @btime ds["datac"][:,:,4];
417.527 μs (35 allocations: 8.00 MiB)
Feel free to make other benchmarks with other indices.
Another use case is to load data over the network via opendap or HTTP byte ranges request, you certainly do not want to transfer more data than necessary.
It will be more efficient to cache the current chunk in use when using a for
loop to iterate over the data because each reading has a non-neglegible overhead (e.g. HDF5.jl).
@meggart instead of using Flatten.jl we could have a mechanism for propagating the engine trait between DiskArray layers? It might be more robust.
So they can all just forward to the parent disk object, but e.g BroadcastDiskArray might need to store the trait somewhere.
Hey,
this is probably old news for you, but I just discovered BlockArrays.jl. If I understand correctly, BlockArrays.jl implements a lot of, if not all the features we need for DiskArrays.jl.
Does this make DiskArrrays.jl obsolote?
Sorry, if I'm going through the same thought process you already covered long ago, but may this can help.
(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
[4c88cf16] Aqua v0.5.5
[3c3547ce] DiskArrays v0.3.6
julia> using DiskArrays, Aqua
julia> Aqua.test_all(DiskArrays)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
13 ambiguities found
Ambiguity #1
copyto!(dest::AbstractMatrix, Rdest::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, src::SparseArrays.AbstractSparseMatrixCSC{T}, Rsrc::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:481
copyto!(dest::DiskArrays.AbstractDiskArray, Rdest::CartesianIndices, src::AbstractArray, Rsrc::CartesianIndices) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:8
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, ::SparseArrays.AbstractSparseMatrixCSC{T}, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T
Ambiguity #2
reverse(a::DiskArrays.AbstractDiskArray, dims) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:23
reverse(A::AbstractVector, start::Integer) in Base at array.jl:1803
Possible fix, define
reverse(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Integer)
Ambiguity #3
copyto!(dest::AbstractArray, D::SuiteSparse.CHOLMOD.Dense) in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:830
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray, ::SuiteSparse.CHOLMOD.Dense)
Ambiguity #4
copyto!(A::AbstractMatrix, B::SparseArrays.AbstractSparseMatrixCSC) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:458
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::SparseArrays.AbstractSparseMatrixCSC)
Ambiguity #5
copyto!(dest::AbstractMatrix, src::LinearAlgebra.AbstractQ) in LinearAlgebra at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:599
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::LinearAlgebra.AbstractQ)
Ambiguity #6
commonlength(::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:154
commonlength(a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:155
Possible fix, define
commonlength(::Tuple{}, ::Tuple{})
Ambiguity #7
_commonlength(a1, b1, ::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:157
_commonlength(a1, b1, a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:158
Possible fix, define
_commonlength(::Any, ::Any, ::Tuple{}, ::Tuple{})
Ambiguity #8
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray, src::AbstractArray) in Base.PermutedDimsArrays at permuteddimsarray.jl:215
Possible fix, define
copyto!(::PermutedDimsArray, ::DiskArrays.AbstractDiskArray)
Ambiguity #9
copyto!(dest::AbstractArray{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:828
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray{T}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}
Ambiguity #10
copyto!(A::SparseArrays.SparseVector, B::AbstractVector) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsevector.jl:501
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
Possible fix, define
copyto!(::SparseArrays.SparseVector, ::DiskArrays.AbstractDiskArray{T, 1} where T)
Ambiguity #11
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray{T, N}, src::AbstractArray{T, N}) where {T, N} in Base.PermutedDimsArrays at permuteddimsarray.jl:211
Possible fix, define
copyto!(::PermutedDimsArray{T, N}, ::DiskArrays.AbstractDiskArray{T, N}) where {T, N}
Ambiguity #12
_reshape(parent::DiskArrays.AbstractDiskArray, dims::Tuple{Vararg{Int64, var"#164#N"}}) where var"#164#N" in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/reshape.jl:79
_reshape(v::AbstractVector, dims::Tuple{Int64}) in Base at reshapedarray.jl:168
Possible fix, define
_reshape(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Tuple{Int64})
Ambiguity #13
copyto!(dest::AbstractMatrix{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:829
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5
Possible fix, define
copyto!(::DiskArrays.AbstractDiskArray{T, 2}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}
Method ambiguity: Test Failed at /Users/yeesian/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
Expression: success(pipeline(cmd; stdout = stdout, stderr = stderr))
Stacktrace:
[1] macro expansion
@ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
[2] _test_ambiguities(packages::Vector{Base.PkgId}; color::Nothing, exclude::Vector{Any}, detect_ambiguities_options::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Aqua ~/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
Test Summary: | Fail Total Time
Method ambiguity | 1 1 1.7s
ERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.
zarr[end:-1:1, :, :]
not working
There is probably some lastindex implementation missing. Should be added as a new indexing method
zip
is used in array equality, and breaks chunking order when arrays with different chunks are used.
We should at minimum fix it so that the chunks of the first disk array found are used for all others (and for regular arrays). This can be achieved with a RechunkDiskArray
wrapper that simply forces a different chunk pattern. It may be slow in some cases but will at least be correct.
Later we could optimise the combined iteration pattern of all diskarrays passed to zip
.
I noticed the the dimensions get moved around in unintuitive ways
dc = Zarr.zopen(path2zarr)
size(dc["v"])
(834, 834, 84396)
r = [1 4 6 1]
c = [7 8 9 3]
cartind = CartesianIndex.(r,c)
size(dc["v"][cartind, :])
(1, 84396, 4)
cartind = CartesianIndex.(r',c')
size(dc["v"][cartind, :])
(4, 84396, 1)
This is different behavior that non DiskArrays
z = rand(100,100,1000)
cartind = CartesianIndex.(r,c)
size(z[cartind,:])
(1, 4, 1000)
cartind = CartesianIndex.(r',c')
size(z[cartind,:])
(4, 1, 1000)
Hi, I noticed bad performance when using PermutedDimsArray
with ArchGDAL and the issue probably is located here. See yeesian/ArchGDAL.jl#225, which explains the problem. However I don't see an easy fix like proposed over there, as it runs into a stack overflow for AbstractDiskArrays.
ArchGDAL is currently failing tests on 1.9/nightly with "type Stateful has no field taken".
That seems to be due to the .taken
here:
Lines 53 to 60 in 68c8150
This field was removed in JuliaLang/julia#45924.
This is the full error message from ArchGDAL.
Complex IO: Error During Test at /home/runner/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294
Test threw exception
Expression: AG.getband(ds_copy, 1) == a
type Stateful has no field taken
Stacktrace:
[1] getproperty
@ ./Base.jl:37 [inlined]
[2] iterate(a::ArchGDAL.IRasterBand{ComplexF32}, i::Tuple{OffsetArrays.OffsetMatrix{ComplexF32, Matrix{ComplexF32}}, DiskArrays.BlockedIndices{DiskArrays.GridChunks{2}}, Tuple{Base.Iterators.Stateful{DiskArrays.GridChunks{2}, Union{Nothing, Tuple{Any, Tuple{Any, Vararg{Any}}}}, Int64}, Base.Iterators.Stateful{CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}, Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}, Int64}}})
@ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/iterator.jl:53
[3] _zip_iterate_some
@ ./iterators.jl:424 [inlined]
[4] _zip_iterate_all
@ ./iterators.jl:416 [inlined]
[5] iterate
@ ./iterators.jl:407 [inlined]
[6] ==(A::ArchGDAL.IRasterBand{ComplexF32}, B::Matrix{ComplexF32})
@ Base ./abstractarray.jl:2919
[7] eval_test(evaluated::Expr, quoted::Expr, source::LineNumberNode, negate::Bool)
@ Test /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:320
[8] macro expansion
@ /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:477 [inlined]
[9] (::var"#391#393"{Matrix{ComplexF32}})(ds_copy::ArchGDAL.Dataset)
@ Main ~/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294
It's not directly clear to me what the right fix is. Is that calling length()
on the iterator?
For GridChunks, we could use TiledIteration.jl instead of implementing everything here. In general, I wonder if TiledIteration.jl couldn't become the interface module for all tiled array iteration (if it isn't already).
I have an implementation of iteration using TiledIteration.jl in HDF5Arrays.jl, so I could make a PR, if there's interest.
If we limit ourselves to tiles (GridChunks
in DiskArrays.jl), we can also use tiled indexing to make a lot of the code implementing broadcasting, reduce operations, etc. cleaner. Further, the code could be moved to TiledIteration.jl, so non-DiskArrays can benefit from the tools here, too. See this proposal for tiled indexing
Hi Fabian, when I have some issue with the integration of NCDatasets/DiskArrays and I would like to know a bit better how DiskArrays works independently of NCDatasets. I noticed in the example in your README.md that RangeArray
is not defined.
julia> a = RangeArray(4,5,1)
ERROR: UndefVarError: RangeArray not defined
Maybe it should be something like a = PseudoDiskArray(randn(4,5,1))
if I understand the example.
Would it be possible to add a ConstructionBase.jl dependency here? (its tiny)
The reason is DiskArrays.SubDiskArray
and probably other objects here can't be reconstructed from their fields alone - they need type information as well. ConstructionBase.jl defines constructorof
to handle this. Then @set
from Setfield.jl and reconstruct
from Flatten.jl (and BangBang.jl, Accessors.jl etc) can update their fields with new objects.
My use case is opening GeoData.FileArray with write access through nested DiskArrays.SubDiskArray
and whatever else is wrapping it.
As you don't know the number of wrappers or what they are it's hard to do this manually. Flatten.modify
can just do it automatically and rewrap the whole stack however deep it is. But only if ConstructionBase.constructorof
works for all the wrappers.
The option is to put it in ConstructionBaseExtras.jl. But using Requires.jl and a wrapper package is a half-baked measure without dependency bounds, so I like to ask for a direct dep first.
Currently cat
is very slow, and may even give the wrong anwer?
It could be more efficient to allocate the final array and then broadcast into views of it, instead of copying each disk array to memory separately.
Iteration seems to be broken with chunking, so when fallbacks are used by base methods you just get the wrong answer, without an error. So e.g. Array(diskarray)
currently uses iteration, and can give the wrong result:
julia> a = collect(reshape(1:90,10,9))
10×9 Matrix{Int64}:
1 11 21 31 41 51 61 71 81
2 12 22 32 42 52 62 72 82
3 13 23 33 43 53 63 73 83
4 14 24 34 44 54 64 74 84
5 15 25 35 45 55 65 75 85
6 16 26 36 46 56 66 76 86
7 17 27 37 47 57 67 77 87
8 18 28 38 48 58 68 78 88
9 19 29 39 49 59 69 79 89
10 20 30 40 50 60 70 80 90
julia> a_disk = _DiskArray(a, chunksize=(5,3))
Disk Array with size 10 x 9
julia> Array(a_disk)
10×9 Matrix{Int64}:
1 21 16 31 51 46 61 81 76
2 22 17 32 52 47 62 82 77
3 23 18 33 53 48 63 83 78
4 24 19 34 54 49 64 84 79
5 25 20 35 55 50 65 85 80
11 6 26 41 36 56 71 66 86
12 7 27 42 37 57 72 67 87
13 8 28 43 38 58 73 68 88
14 9 29 44 39 59 74 69 89
15 10 30 45 40 60 75 70 90
#37 may be a partial solution to this. But maybe also good to use other methods as much as possible so the iteration fallback isn't used.
It would be good to dispatch on the DiskArray implementation for the lazy reverse when one specifies the dimension by a keyword argument.
It dispatches directly if the dimension is the second positional argument, but the keyword argument is dispatched into the default Base implementation.
a:: DiskArray
julia> reverse(a, dims=1) # Non lazy
10000×10000 Matrix{UInt8}:
0x00 0x00 … 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 … 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
⋮ ⋱
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 … 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
julia> reverse(a, 1) # Lazy
Disk Array with size 10000 x 10000
Should it be nothing
, or just the entire array, or should it throw a method error? Or something else?
I think the general indexing functionality for this package is there now, so that one could use this package as a dependency for NetCDF, HDF5 or Zarr without losing functionality. Here we can colltect a list to the packages that might profit from DiskArrays and link to the respective PRs:
@visr @rafaqz @Alexander-Barth any suggestions and feedback would be highly appreciated
It seems possible to add a reducing function application wrapper that will take e.g. the mean/sum etc along an axis lazily, similar to how lazy broadcasting works.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.