meggart / diskarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

66.0 66.0 12.0 705 KB

License: Other

Julia 100.00%

diskarrays.jl's People

Contributors

Stargazers

Watchers

Forkers

luapulu rafaqz liamdiprose felixcremer gdkrmr sonicious visr alexander-barth wkearn vtjnash josbert1 mtfishman

diskarrays.jl's Issues

With changes to chunking, `copyto!` now fails in places it used to succeed.

Writing with GDAL to a non-chunked file like ".rst" from a chunked file like ".tif" gives the dreaded

Chunks do not align in dimension 2

error.

This is an unfortunate effect of the recent changes. I'm not sure of the exact reason.

Very slow performance indexing into netcdf with a vector

Indexing with a vector can be incredibly slow with the current implementation of batchgetindex.

From this comment by @Alexander-Barth
#131 (comment)

NetCDF.jl + DiskArray

julia> using NetCDF

julia> v = NetCDF.open("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0","water_temp");

julia> size(v)
(4500, 4251, 40, 14969)

julia> v1 = @time v[1000,1000,1,1:10];
  1.478774 seconds (3.71 M allocations: 250.669 MiB, 3.42% gc time, 59.32% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.467219 seconds (45 allocations: 1.781 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
739.847348 seconds (1.32 M allocations: 819.944 MiB, 0.03% gc time, 0.10% compilation time)
From this comment by @Alexander-Barth 
https://github.com/meggart/DiskArrays.jl/issues/131#issuecomment-1902759762

NCDatasets 0.12 without DiskArray

julia> using NCDatasets

julia> ds = NCDataset("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0"); v = ds["water_temp"];

julia> v1 = @time v[1000,1000,1,1:10];
  0.726571 seconds (134.15 k allocations: 8.849 MiB, 14.90% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.218793 seconds (62 allocations: 1.984 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
  0.879592 seconds (982.55 k allocations: 66.829 MiB, 9.49% gc time, 65.79% compilation time)

So 12 minutes versus 1 seconds for this example. Yes, I know that I can bypass DiskArrays by providing my own batchgetindex. But I don't think that DiskArray should make the current assumptions if it aims to be generic (if this is the goal).

Can we have the current assumptions (i.e. reading a chunk similarly fast than reading a subset of chunk) documented?

Concerning the API, another think I am wondering if we need a function named batchgetindex at all. Why not having getindex simply pass the indices to DiskArrays.readblock! and let DiskArrays.readblock! figure out how to best load the data. DiskArrays.readblock! is already specific to the storage format.

Wrong behavior of default Array constructor

I do not have a reproducible example at hand, but there is a problem with using the default Array constructor on chunked AbstractDiskArrays. Adding the following method fixed the code:

function Array(a::AbstractDiskArray)
    aout = similar(a)
    aout .= a
    aout
end

but it is worth to investigate and add some unit tests when fixing this.

fix generators

We could define a DiskGenerator object with a method for collect that properly uses eachindex to populate the in-memory array in the right order.

1D variable with unlimited dimensions

While debugging an issue in NCDatasets using DiskArrays v0.3.20 ( Alexander-Barth/NCDatasets.jl#230 )
I found some behavior of NetCDF.jl with DiskArray with variables of unlimited dimensions that I could not understand.
While the following code with a 2D array (one unlimited) works without error:

import NetCDF
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
lon_dim = NetCDF.NcDim("lon",1:3);
v = NetCDF.NcVar("obs",[lon_dim,time_dim]);
NetCDF.create(fname2,v) do nc
   nc["obs"][:,1:3] = rand(3,3)
end;

The following code with a single unlimited dimensions fails:

import NetCDF
var = collect(1:4)
fname2 = tempname()
time_dim = NetCDF.NcDim("time",Int[],unlimited=true);
v = NetCDF.NcVar("var",[time_dim]);
NetCDF.create(fname2,v) do nc
    nc["var"][1:4] = var
end;

with the following error message:

ERROR: BoundsError: attempt to access 0-element CartesianIndices{1, Tuple{Base.OneTo{Int64}}} at index [1:4]                                                       
Stacktrace:
  [1] throw_boundserror(A::CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, I::Tuple{UnitRange{Int64}})                                                                      
    @ Base ./abstractarray.jl:744
  [2] checkbounds
    @ ./abstractarray.jl:709 [inlined]
  [3] getindex
    @ ./multidimensional.jl:358 [inlined]
  [4] view
    @ ./multidimensional.jl:372 [inlined]
  [5] interpret_indices_disk
    @ ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:137 [inlined]
  [6] setindex_disk!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:56
  [7] setindex!(a::NcVar{Float64, 1, 6}, v::UnitRange{Int64}, i::UnitRange{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/VLHvZ/src/diskarray.jl:222
  [8] (::var"#15#16")(nc::NcFile)
    @ Main ./REPL[54]:2
  [9] create(::var"#15#16", ::String, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})                                     
    @ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1022
 [10] create(::Function, ::String, ::NcVar{Float64, 1, 6})
    @ NetCDF ~/.julia/packages/NetCDF/7hOe9/src/NetCDF.jl:1019
 [11] top-level scope
    @ REPL[54]:1

NetCDF.jl and the equivalent NCDatasets.jl code both fail at the same line:

https://github.com/meggart/DiskArrays.jl/blob/v0.3.20/src/diskarray.jl#L137

Ref:
#53

`view` with an AbstractRange (not AbstractUnitRange) is broken

This is because subsetchunks is specifically only written for reverse

DiskArrays.jl/src/chunks.jl

Lines 43 to 50 in d0da1e0

 function subsetchunks(r::RegularChunks, subs::AbstractRange) 

 # This is a method only to make "reverse" work and should error for all other cases 

 if step(subs) == -1 && first(subs) == r.s && last(subs) == 1 

 lastlen = length(last(r)) 

 newoffset = r.cs - lastlen 

 return RegularChunks(r.cs, newoffset, r.s) 

 end 

 end

Should we just subset it as if there is not step? will that work?

Edit: it doesn't.

Maybe the batch PR can work for this as well? views like this are very useful for reducing the size of (very) large rasters for plotting. Rasters.jl has been using this prior to 0.3.

Chunked broadcast with objects of smaller dimensions is broken.

Using the setup code from the tests (below) this chunked 2*2 broadcast with a vector seems broken:

julia> a_disk = _DiskArray([1 10; 2 20; 3 30; 4 40], chunksize=(2, 2))
Disk Array with size 4 x 2

julia> a_disk .* [1, 2, 3, 4] |> collect
4×2 Matrix{Int64}:
  1    9
  4   16
 10   90
 40  160

julia> [1 10; 2 20; 3 30; 4 40] .* [1, 2, 3, 4]
4×2 Matrix{Int64}:
  1   10
  4   40
  9   90
 16  160

We get the chunks as columns. The same problem occurs for a Tuple, I found this trying to add Tuple support in #39.

Setup code:

using DiskArrays
using DiskArrays: ReshapedDiskArray, PermutedDiskArray
struct _DiskArray{T,N,A<:AbstractArray{T,N}} <: AbstractDiskArray{T,N}
  getindex_count::Ref{Int}
  setindex_count::Ref{Int}
  parent::A
  chunksize::NTuple{N,Int}
end
_DiskArray(a;chunksize=size(a)) = _DiskArray(Ref(0),Ref(0),a,chunksize)
Base.size(a::_DiskArray) = size(a.parent)
DiskArrays.haschunks(::_DiskArray) = DiskArrays.Chunked()
DiskArrays.eachchunk(a::_DiskArray) = DiskArrays.GridChunks(a,a.chunksize)
getindex_count(a::_DiskArray) = a.getindex_count[]
setindex_count(a::_DiskArray) = a.setindex_count[]
trueparent(a::_DiskArray) = a.parent
getindex_count(a::ReshapedDiskArray) = getindex_count(a.parent)
setindex_count(a::ReshapedDiskArray) = setindex_count(a.parent)
trueparent(a::ReshapedDiskArray) = trueparent(a.parent)
getindex_count(a::PermutedDiskArray) = getindex_count(a.a.parent)
setindex_count(a::PermutedDiskArray) = setindex_count(a.a.parent)
trueparent(a::PermutedDiskArray{T,N,<:PermutedDimsArray{T,N,perm,iperm}}) where {T,N,perm,iperm} = permutedims(trueparent(a.a.parent),perm)
function DiskArrays.readblock!(a::_DiskArray,aout,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  #println("reading from indices ", join(string.(i)," "))
  a.getindex_count[] += 1
  aout .= a.parent[i...]
end
function DiskArrays.writeblock!(a::_DiskArray,v,i::AbstractUnitRange...)
  ndims(a) == length(i) || error("Number of indices is not correct")
  all(r->isa(r,AbstractUnitRange),i) || error("Not all indices are unit ranges")
  #println("Writing to indices ", join(string.(i)," "))
  a.setindex_count[] += 1
  view(a.parent,i...) .= v
end

Add Initialisation, Base.resize!, Base.append! and concatenation to interface

I had some ideas for possible additions:

It might be cool for something like arr = SomeDiskArray{Float64}(file_handle, undef, 100, 10) to be part of the interface. This would make it possible to treat a DiskArray almost exactly like an Array. Functions like map would "just" work. May be even go so far as to allow arr = SomeDiskArray{Float64}(undef, 100, 10) where the file object is automatically created in the current working dir with some temp name.

If a DiskArray implements the resize! method, we can have an automatic append! be defined for any DiskArray.

For concatenation, we can define a cat!(out, A..., dims=dims) function. If out can be resized, it can be any size, if it cannot, the size of out must be the sum of the sizes of all A. May be use a Trait to specify whether or not a DiskArray can be resized. If we have initialisation defined, we can even have cat work like it does for normal arrays. It'll just initialise a new array and write into it.

Batch getindex is not typesafe

Batchindex with batches is not typesafe at the moment.

julia> @code_warntype c.data[[(1, 1), (1, 2)], :, :]
MethodInstance for getindex(::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}, ::Vector{Tuple{Int64, Int64}}, ::Colon, ::Colon)
  from getindex(a::DiskArrays.AbstractDiskArray, i...) in DiskArrays at /home/gkraemer/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:174
Arguments
  #self#::Core.Const(getindex)
  a::DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}
  i::Tuple{Vector{Tuple{Int64, Int64}}, Colon, Colon}
Body::Any
1 ─ %1 = Core.tuple(a)::Tuple{DiskArrayTools.DiskArrayStack{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 3}, 1}}
│   %2 = Core._apply_iterate(Base.iterate, DiskArrays.getindex_disk, %1, i)::Any
└──      return %2

optimization for var[:,:,[1,2]], wrong result with indexing with two vectors and the type-stability issue

In NCDatasets 0.12.17 I had optimization for var[:,:,[1,2]] converting it to var[:,:,1:2] and
var[:,:,[1,2,10,11]] into two load operation v[:,:,1:2] and v[:,:,10:11]. This was implemented here:

https://github.com/Alexander-Barth/NCDatasets.jl/blob/v0.12.17/src/cfvariable.jl#L359-L436

Now a call to v[:,:,[1,2]] loads every element individually (which quite slow).

In fact, the relevant code is still in NDatasets 0.13 but no longer called.

Maybe this functionality should go into DiskArrays?

Chunks do not align in dimension 1

Can we work around this? This being an error is a problem for using chunked broadcast as a default.

I'm not totally across the underlying mechanisms yet, but I can imagine 2 possibilities:

force chunk alignment by loading multiple chunks to a temporary chunk of the right size
load the whole array to memory and ignore chunking, especially if the array is small

Then the error could be a warning.

Changes to iteration over an `Unchunked` SubDiskArray?

Equality of an array and a view of a DiskArray based Raster has broken with 0.3 when the array is Unchunked:

using Rasters
url = "https://download.osgeo.org/geotiff/samples/gdal_eg/cea.tif"
path = download(url)
A = Raster(path)
view(A, 1, 2:3, 1) == [0x00, 0x6b]

Gives the error:

ERROR: MethodError: Cannot `convert` an object of type Tuple{UnitRange{Int64}} to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at /opt/julia/share/julia/base/twiceprecision.jl:262
  convert(::Type{T}, ::AbstractChar) where T<:Number at /opt/julia/share/julia/base/char.jl:185
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at /opt/julia/share/julia/base/multidimensional.jl:136
  ...

Stacktrace:
  [1] DiskArrays.RegularChunks(cs::Tuple{UnitRange{Int64}}, offset::Int64, s::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:17
  [2] (::DiskArrays.var"#34#37")(s::Int64, cs::Tuple{UnitRange{Int64}}, of::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:101
  [3] (::Base.var"#4#5"{DiskArrays.var"#34#37"})(a::Tuple{Int64, Tuple{UnitRange{Int64}}, Int64})
    @ Base ./generator.jl:36
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Base.Iterators.Zip{Tuple{Tuple{Int64}, DiskArrays.GridChunks{1}, Tuple{Int64}}}, Base.var"#4#5"{DiskArrays.var"#34#37"}})
    @ Base ./array.jl:724
  [6] map
    @ ./abstractarray.jl:2948 [inlined]
  [7] DiskArrays.GridChunks(a::Tuple{Int64}, chunksize::DiskArrays.GridChunks{1}; offset::Tuple{Int64})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:100
  [8] #GridChunks#28
    @ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
  [9] GridChunks
    @ ~/.julia/packages/DiskArrays/7xJDq/src/chunks.jl:98 [inlined]
 [10] eachchunk_view(#unused#::DiskArrays.Unchunked, a::SubArray{UInt8, 1, FileArray{GDALfile, UInt8, 3, Nothing, DiskArrays.GridChunks{3}, DiskArrays.Unchunked}, Tuple{Int64, UnitRange{Int64}, Int64}, false})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:35
 [11] eachchunk(a::DiskArrays.SubDiskArray{UInt8, 1})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/subarrays.jl:26
 [12] iterate(a::DiskArrays.SubDiskArray{UInt8, 1})
    @ DiskArrays ~/.julia/packages/DiskArrays/7xJDq/src/iterator.jl:4
 [13] iterate
    @ ~/.julia/dev/DimensionalData/src/array/array.jl:66 [inlined]
 [14] _zip_iterate_some
    @ ./iterators.jl:358 [inlined]
 [15] _zip_iterate_all
    @ ./iterators.jl:350 [inlined]
 [16] iterate
    @ ./iterators.jl:340 [inlined]
 [17] ==(A::Raster{UInt8, 1, Tuple{Y{Projected{Float64, LinRange{Float64, Int64}, ReverseOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, Y{Colon}}}}, Tuple{X{Projected{Float64, LinRange{Float64, Int64}, ForwardOrdered, Regular{Float64}, Intervals{Start}, Metadata{GDALfile, Dict{Any, Any}}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing, X{Colon}}}, Band{Categorical{Int64, UnitRange{Int64}, ForwardOrdered, NoMetadata}}}, DiskArrays.SubDiskArray{UInt8, 1}, Symbol, Metadata{GDALfile, Dict{Symbol, Any}}, Nothing}, B::Vector{UInt8})
    @ Base ./abstractarray.jl:2539
 [18] top-level scope
    @ REPL[125]:1

@meggart if you know what is going on with this stacktrace that would be helpful

Docs link is broken

Implement chunked `foreach`/non allocating chunked iteration.

As pointed out in #15 broadcast can do pretty much everything. But one issue is you have to call collect and allocate an array?

In GeoData.jl I'm using broadcast to iterate over arrays to e.g. calculate ranges of non nodata values for a trim method, using broadcast so the operation is chunked. But I still need to call collect on the broadcast object, which allocates.

It would be good if there was a way to iterate over the object with chunking that didn't allocate. Maybe that's possible already?

Does this strictly have anything to do with Disk?

Would ChunkCachedArrays.jl be a better name?

Documentation is missing?

The links of the documentation badges go to 404 errors

https://meggart.github.io/DiskArrays.jl/stable

Scalar indexing can occur on `show` for arrays of DiskArray

This doesn't occur when the DiskArray is shown on its own. This may be from alternate methods of show being called for an array of array. We may need to define show without a MIME type?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

unlimited dimensions in NetCDF

As a follow up of the discussion (Alexander-Barth/NCDatasets.jl#79), I would like to know if unlimited dimensions (i.e. growable) could be supported by DiskArrays. I hope you it is OK for you that I make a separate issue here. It is easier for me to track what is already done and what still needs to be discussed.

The last proposition by Fabian has been:

Maybe it would help if I added a keyword argument checkbounds=false to the DiskArrays setindex implementation. So when you call setindex! in your DimensionallArrays code you can explicitly enforce bounds checking. In addition we can still discuss what the default behavior for any DiskArray should be...

I think this would be a good idea.

One could also annotate the type DiskArray with an additional parameter whether its size is fixed or not:

using Base: setindex!, size

struct FixedSize; end
struct MyArray{T,N,TFS} <: AbstractArray{T,N}; end

Base.size(A::MyArray) = (10,)

function fixedsize(::MyArray{T,N,TFS}) where {T,N,TFS}
    return TFS == FixedSize
end

function Base.setindex!(A::MyArray,value,i::Int)
    if fixedsize(A) && (i > 10)
        error("you can't do that: the array has a fixed size")
    end
    println("ok ",i)
end

# growable array
A = MyArray{Int,1,Nothing}()
A[5] = 1
A[15] = 1

# fixed-size array
Af = MyArray{Int,1,FixedSize}()
Af[5] = 1
Af[15] = 1 # error

It is decided during creation of a DiskArray if the array can grow or not and all subsequent calls can use A[index] = value rather than setindex(A,value,index,checkbounds=true/false). The compile should be able to specialize for fixed and growable arrays (there could even be two definitions of setindex! for the two cases).

CC @rafaqz, @Balinus, @visr

`BoundsError` when extra-indexing with `AbstractVector`

In regular Julia array, it is possible to index a N-dimensional array with more than N indices (see the documentation). When using an AbstractVector for additional indices, it fails with a DiskArray, which is inconsistent with the behaviour of Julia arrays:

Julia arrays:

julia> A = [8,6,7];
julia> A[2, [1]] # or A[2, 1:1]
1-element Vector{Int64}:
 6

_DiskArray from the tests:

julia> DA = _DiskArray(A);
julia> DA[2, [1]] # or DA[2, 1:1]
ERROR: BoundsError: attempt to access 3-element _DiskArray{Int64, 1, Vector{Int64}} at index [2, [1]]
Stacktrace:
 [1] #7
   @ ~/.julia/dev/DiskArrays/src/diskarray.jl:92 [inlined]
 [2] foreach
   @ ./abstractarray.jl:2712 [inlined]
 [3] interpret_indices_disk(A::_DiskArray{Int64, 1, Vector{Int64}}, r::Tuple{Int64, Vector{Int64}})
   @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:91
 [4] getindex_disk(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vararg{Any})
   @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:29
 [5] getindex(::_DiskArray{Int64, 1, Vector{Int64}}, ::Int64, ::Vector{Int64})
   @ Main ~/.julia/dev/DiskArrays/src/diskarray.jl:178
 [6] top-level scope
   @ REPL[56]:1

Note that there's no issue when indexing with an integer (DA[2, 1] works fine)

This issue originated from here

cc @rafaqz

common_chunks fails in broadcast with mixed ndims

Failing here for a broadcast between 2 and 3 dimensional arrays:

https://github.com/meggart/DiskArrays.jl/blob/master/src/ops.jl#L60

ERROR: BoundsError
Stacktrace:
  [1] getindex
    @ ./number.jl:96 [inlined]
  [2] #112
    @ ~/.julia/dev/DiskArrays/src/ops.jl:61 [inlined]
  [3] map(f::DiskArrays.var"#112#120", t::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64})
    @ Base ./tuple.jl:215
  [4] common_chunks(::Tuple{Int64, Int64, Int64}, ::GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, ::Vararg{Any, N} where N)
    @ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:61
  [5] eachchunk(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/ops.jl:22
  [6] iterate(a::DiskArrays.BroadcastDiskArray{Bool, 3, Broadcasted{DiskArrays.ChunkStyle{3}, Nothing, typeof(==), Tuple{GeoData.FileArray{GeoData.GDALfile, Float32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, GeoData.FileArray{SMAPfile, Float32, 2, String, Tuple{Int64, Int64}, Symbol, Nothing}}}})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/iterator.jl:4
  [7] _all

Because at that point:

tt = ((3856, 0), (1624, 0), 1)

~~I'm not quite sure what the solution is.~~

DiskArrays interface for files that are not yet open

I'm implementing a DiskArrays.jl interface in GeoData.jl. But it doesn't keep files open - they are opened when data is required, and closed again afterwards.

The issue is passing opened files to DiskArrays to apply chunked read/write e.g. during broadcast, and closing the file again afterwards. Using closures or some kind of callback function that runs after all chunks are processed should help here - but I'm not sure exactly where that should go.

Linear indexing?

Linear indexing doesn't seem to work with an AbstrctDiskArray. Is that intentional or do I need to add something to the interface? Here FileArray <: AbstractDiskArray

julia> A[1]
ERROR: BoundsError: attempt to access 267×178×1 GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}} at index [1]
Stacktrace:
 [1] interpret_indices_disk
   @ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:118 [inlined]
 [2] getindex_disk(a::GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, i::Int64)
   @ DiskArrays ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:58
 [3] getindex
   @ ~/.julia/packages/DiskArrays/QATDH/src/DiskArrays.jl:186 [inlined]
 [4] getindex(A::GeoArray{Int32, 3, Tuple{X{LinRange{Float64}, Projected{Ordered{ForwardIndex, ForwardArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Y{LinRange{Float64}, Projected{Ordered{ReverseIndex, ReverseArray, ForwardRelation}, Regular{Float64}, Intervals{Start}, WellKnownText{GeoFormatTypes.CRS, String}, Nothing}, Metadata{GeoData.GDALfile, Dict{Any, Any}}}, Band{UnitRange{Int64}, Categorical{Ordered{ForwardIndex, ForwardArray, ForwardRelation}}, NoMetadata}}, Tuple{}, GeoData.FileArray{GeoData.GDALfile, Int32, 3, String, Tuple{Int64, Int64, Int64}, Nothing, Tuple{Int64, Int64, Int64}}, Symbol, Metadata{GeoData.GDALfile, Dict{Symbol, Any}}, Float64}, i::Int64)
   @ DimensionalData ~/.julia/dev/DimensionalData/src/indexing.jl:22
 [5] top-level scope
   @ REPL[56]:1

`cat` should work for iterators of Ints

According to the documentation of cat it should be possible to call it with an iterable of dimensions so that the "the positions along these dimensions are increased
simultaneously for each input array, filling with zero elsewhere. This allows one to construct
block-diagonal matrices as cat(matrices...; dims=(1,2)), and their higher-dimensional analogues."

We should try to adhere to this interface and I am wondering whether we would need to have a new type ConcatBlockDiskArray to overload the writeblock and readblock functions.

This part of the cat interface is used for example in DimensionalData. Therefore we can't currently use cat of a DiskArray backed AbstractDimArray.

eachchunk fall-back for transposed or offset arrays?

This is low priority issue for me, but when DiskArrays are used for memory arrays, the it might be useful to adapt the default chunksize for transposed arrays:

julia> DiskArrays.default_chunk_size[] = 1
1

julia> eachchunk(ones(4000,4000))
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
 (1:4000, 1:31)  (1:4000, 32:62)  (1:4000, 63:93)  …  (1:4000, 4000:4000)

julia> eachchunk(ones(4000,4000)')
1×130 DiskArrays.GridChunks{2, Tuple{DiskArrays.RegularChunks, DiskArrays.RegularChunks}}:
 (1:4000, 1:31)  (1:4000, 32:62)  (1:4000, 63:93)  …  (1:4000, 4000:4000)

For a transposed array I would expect that the first chunk size of (1:31, 1:4000). I guess one can use stride to check this.

I am note sure if OffsetArray is supposed to be sported but currently all indices from eachchunk are ones-based

using  DiskArrays, OffsetArray
DiskArrays.default_chunk_size[] = 1
OA = OffsetArray(ones(4000,4000),(-1,-1));
c = eachchunk(OA);
OA[last(c)...]
# ERROR: BoundsError: attempt to access 4000×4000 OffsetArray(::Matrix{Float64}, 0:3999, 0:3999) with eltype Float64 with indices 0:3999×0:3999 at index [1:4000, 4000:4000]

None of these issue affect me currently in any production code, but I thought it might be useful to notify you about it.

I am using DiskArrays v0.3.22

About this package

Hi @rafaqz ,

as discussed on discourse you showed some interest in defining something like a AbstractDiskArray I am not sure if you have started something like this already, but since I am trying to adapt the indexing behavior of NetCDF.jl and Zarr.jl and want to avoid double work I started this package which might be merged with your efforts. The basic idea is that all a user has to provide is a function that reads a cartesian subrange from a array that is on disk and this package takes care of the remaining index details, translating trailing and missing singleton dimensions etc.

Currently this supports getindex only, but you might get an idea of what my idea for such a package could be. Your opinion would be highly appreciated.

Next I would continue to remove the getindex implementations from Zarr and NetCDF and instead use this interface here, so that any indexing improvement would help both packages.

Index error when using mapCube and Distributed

Here a minimal working example:

using Pkg

Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")

using Distributed
println(myid()," ", DEPOT_PATH)

addprocs(10)
@everywhere begin
using Pkg

Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")
 
using YAXArrays
using Zarr
function simple_function(out1, cube_in1, cube_in2)
    #=
    out1 .= 1
    =#
end
end

lcc_final_cube = open_dataset("/Net/Groups/BGI/scratch/dpabon/lcc_final_cube_2002_2011_8d-0.083deg.zarr")
lcc_final_cube = Cube(lcc_final_cube)


lcc_classes = ["Evergreen_Broadleaf_Forests", 
    "Deciduous_Broadleaf_Forests",
    "Mixed_Forest",
    "Savannas", 
    "Grasslands",
    "Croplands"]


lcc_small = lcc_final_cube[lon =(-78, -68), lat = (-5, 15), time = 2002:2003, classes = lcc_classes]



lst_small = open_dataset("/Net/Groups/BGI/scratch/dpabon/lst_small.zarr")
lst_small = Cube(lst_small)

renameaxis!(lcc_small, "lon" => getAxis("lon", lst_small))
renameaxis!(lcc_small, "lat" => getAxis("lat", lst_small))

 

indims_cube_1 = InDims("time")

indims_cube_2 = InDims("time", "classes")

out_1_dims = OutDims("time", CategoricalAxis("summary_stat", ["rsquared"]))
    

mapCube(simple_function, (lst_small, lcc_small), indims = (indims_cube_1, indims_cube_2), outdims = out_1_dims)

produce:

ERROR: On worker 2:
BoundsError: attempt to access 20×6×120×45 Array{Float32, 4} at index [1:20, 1, 1225:1296, 901:945]
Stacktrace:
  [1] throw_boundserror
    @ ./abstractarray.jl:703
  [2] checkbounds
    @ ./abstractarray.jl:668 [inlined]
  [3] _setindex!
    @ ./multidimensional.jl:929 [inlined]
  [4] setindex!
    @ ./abstractarray.jl:1344 [inlined]
  [5] filldata!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:127
  [6] #disk_getindex_batch!#38
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:113
  [7] disk_getindex_batch
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:121
  [8] batchgetindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/batchgetindex.jl:72
  [9] getindex_disk
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:28
 [10] getindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
 [11] readblock!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrayTools/WsUY6/src/DiskArrayTools.jl:250
 [12] readblock!
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/subarray.jl:25
 [13] getindex_disk
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:31
 [14] getindex
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/DiskArrays/f8PI0/src/diskarray.jl:177
 [15] updatear
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:639
 [16] #78
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:571
 [17] #59
    @ ./tuple.jl:556 [inlined]
 [18] BottomRF
    @ ./reduce.jl:81 [inlined]
 [19] _foldl_impl
    @ ./reduce.jl:62
 [20] foldl_impl
    @ ./reduce.jl:48 [inlined]
 [21] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
 [22] #mapfoldl#259
    @ ./reduce.jl:170 [inlined]
 [23] #foldl#260
    @ ./reduce.jl:193 [inlined]
 [24] foreach
    @ ./tuple.jl:556 [inlined]
 [25] updatears
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:568
 [26] updateinars
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:656
 [27] #107
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:700
 [28] fnew
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/Fe7F8/src/DAT/DAT.jl:665
 [29] #56
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1016
 [30] #invokelatest#2
    @ ./essentials.jl:729
 [31] invokelatest
    @ ./essentials.jl:726
 [32] #110
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285
 [33] run_work_thunk
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:70
 [34] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:285 [inlined]
 [35] #109
    @ ./task.jl:484
Stacktrace:
  [1] (::Base.var"#939#941")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#939#941", itr::Vector{Any})
    @ Base ./abstractarray.jl:2774
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::DiskArrays.GridChunks{2})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#924
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#923
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
  [8] pmap(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2})
    @ Distributed /Net/Groups/BGI/people/dpabon/bin/julia-1.8.5-linux-x86_64/julia-1.8.5/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99
  [9] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1015 [inlined]
 [10] macro expansion
    @ ./task.jl:454 [inlined]
 [11] macro expansion
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1014 [inlined]
 [12] macro expansion
    @ ./task.jl:454 [inlined]
 [13] progress_map(::Function, ::Vararg{Any}; mapfun::typeof(pmap), progress::ProgressMeter.Progress, channel_bufflen::Int64, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ProgressMeter /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1007
 [14] #progress_pmap#60
    @ /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/ProgressMeter/sN2xr/src/ProgressMeter.jl:1032 [inlined]
 [15] pmap_with_data(f::Function, p::WorkerPool, c::DiskArrays.GridChunks{2}; initfunc::Function, progress::ProgressMeter.Progress, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:668
 [16] pmap_with_data(f::Function, c::DiskArrays.GridChunks{2}; initfunc::Function, kwargs::Base.Pairs{Symbol, ProgressMeter.Progress, Tuple{Symbol}, NamedTuple{(:progress,), Tuple{ProgressMeter.Progress}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:673
 [17] runLoop(dc::YAXArrays.DAT.DATConfig{2, 1}, showprog::Bool)
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:698
 [18] mapCube(::typeof(simple_function), ::Tuple{YAXArray{Union{Missing, Float32}, 3, DiskArrayTools.CFDiskArray{Float32, 3, Float32, DiskArrays.SubDiskArray{Float32, 3}}, Vector{RangeAxis}}, YAXArray{Union{Missing, Float32}, 4, DiskArrays.SubDiskArray{Union{Missing, Float32}, 4}, Vector{CubeAxis}}}; max_cache::Float64, indims::Tuple{InDims, InDims}, outdims::OutDims, inplace::Bool, ispar::Bool, debug::Bool, include_loopvars::Bool, showprog::Bool, irregular_loopranges::Bool, nthreads::Dict{Int64, Int64}, loopchunksize::Dict{Any, Any}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ YAXArrays.DAT /Net/Groups/BGI/people/dpabon/bin/julia_packages/packages/YAXArrays/au5n4/src/DAT/DAT.jl:475
 [19] top-level scope
    @ ~/dpabon/nfdi4earth_oemc/bin/julia/exploratory_analysis/minimal_example_batch_issue.jl:50

What is the most efficient way to access discrete columns of a chuncked array?

What is the most efficient way to access discrete columns of a zarr array? This follows on from the conversation:
here

Lets say I have a Zarr array that looks like this:

z = rand(100,100,1000)

and let's pretend that z is a chunked Zarr array that lives on S3

Now if I want to get a series of discrete columns from z:

z[3,5,:]
z[5,9,:]
z[10,6,:]

Is there a way I can do this that is efficient and does not read in the same chunk of data more than once?

Support permutedims and reshape

Currently both of these work on DiskArrays (because) of the AbstractArray subtype, but do not get the AbstractDiskArrays behavior. So one would have to define getindex, setindex! and eachchunk in a similar way as is done in subarrays.jl for ReshapedArray{<:Any,<:Any,<:AbstractDiskArray} and and PermutedDimsArray{<:Any,<:Any,<:AbstractDiskArray}

Finding an org for this package

As already discussed in #2 before registering this package I would like to decide about the org where the repo should be hosted. It was suggested that JuliaArrays would be a good fit, it is not obvious who to contact so I am pinging @mbauman who was involved in the initial discourse discussion, if he can comment if this package would meet the criteria to be included in JuliaArrays.

Array conversion much slower than access via [:,:]

When I want to convert my DiskArray into a normal memory Array, the conversion with Array is very slow compared to the getindex acess.
Is this something that we could fix in DiskArrays or is there a better method to do the conversion?

julia> @time palfreqdata = palfreq[:,:]
  0.072899 seconds (157 allocations: 166.524 MiB)
5820×7500 Matrix{Float32}:
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32  …  1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 ⋮                                       ⋱  ⋮                               
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32

julia> @time Array(palfreq.data)
  9.528337 seconds (218.25 M allocations: 14.635 GiB, 21.90% gc time)
5820×7500 Matrix{Float32}:
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32  …  1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 ⋮                                       ⋱  ⋮                               
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32
 1.0f32  1.0f32  1.0f32  1.0f32  1.0f32     1.0f32  1.0f32  1.0f32  1.0f32  1.0f32

Don't load whole chunk when you don't need to, e.g. with NetCDF

We can likely get around this problem by skipping batchgetindex for small indexing operations. We may also need to facilitate easy backend customisation of this behaviour.

Originally posted by @Alexander-Barth in #131 (comment)

In the worst-case scenario (reading a time series for spatial chunked data), one would read a single value from each chunk:

Test with NCDatasets 0.12.17:

using BenchmarkTools                                                                                                                                                  
using NCDatasets                                                                                                                                                      
                                                                                                                                                                      
sz = (1024,1024,64)                                                                                                                                                   
ds = NCDataset("/tmp/tmp.nc","c")                                                                                                                                     
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               
                                                                                                                                                                      
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                               
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            
                                                                                                                                                                      
ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           
                                                                                                                                                                      
@btime ds["data"][1,1,1];                                                                                                                                             
@btime ds["data"][:,:,1];                                                                                                                                             
                                                                                                                                                                      
@btime ds["datac"][1,1,1];                                                                                                                                            
@btime ds["datac"][:,:,1];

output:

julia> @btime ds["data"][1,1,1];                                                                                                                                      
  15.069 μs (34 allocations: 1.62 KiB)                                                                                                                                
                                                                                                                                                                      
julia> @btime ds["data"][:,:,1];                                                                                                                                      
  1.014 ms (35 allocations: 8.00 MiB)                                                                                                                                 
                                                                                                                                                                      
julia> @btime ds["datac"][1,1,1];                                                                                                                                     
  13.518 μs (34 allocations: 1.62 KiB)                                                                                                                                
                                                                                                                                                                      
julia> @btime ds["datac"][:,:,1];                                                                                                                                     
  470.025 μs (35 allocations: 8.00 MiB)

So reading an entire chunk (if you do not need it) is 67 times slower for uncompressed data and 36 times slower for compressed data. But we read the same data multiple times in @btime.
The netCDF library is likely cache the data. Some application might indeed read the same data multiple times but this is less typical. If the data is read for the first time, the differences are not as large:

sz = (1024,1024,64)                                                                                                                                                   
isfile("/tmp/tmp3.nc") && rm("/tmp/tmp3.nc")                                                                                                                          
ds = NCDataset("/tmp/tmp3.nc","c")                                                                                                                                    
defDim(ds,"lon",sz[1])                                                                                                                                                
defDim(ds,"lat",sz[2])                                                                                                                                                
defDim(ds,"time",sz[3])                                                                                                                                               
                                                                                                                                                                      
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])                                                                                              
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)                                                                            
                                                                                                                                                                      
ds["data"][:] = randn(sz);                                                                                                                                            
ds["datac"][:] = randn(sz);                                                                                                                                           
close(ds)                                                                                                                                                             
                                                                                                                                                                      
ds = NCDataset("/tmp/tmp3.nc")                                                                                                                                        
# warm-up                                                                                                                                                             
ds["data"][1,1,1];                                                                                                                                                    
ds["data"][:,:,2];                                                                                                                                                    
                                                                                                                                                                      
@time ds["data"][1,1,3];                                                                                                                                              
@time ds["data"][:,:,4];                                                                                                                                              
                                                                                                                                                                      
@time ds["datac"][1,1,3];                                                                                                                                             
@time ds["datac"][:,:,4];

output

  0.001231 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.001682 seconds (35 allocations: 8.002 MiB)                                                                                                                        
  0.031734 seconds (34 allocations: 1.625 KiB)                                                                                                                        
  0.038516 seconds (35 allocations: 8.002 MiB)

So a difference of about 20 % in this case.

For:

@time ds["data"][512,512,3];
@time ds["data"][:,:,4];

@time ds["datac"][512,512,3];
@time ds["datac"][:,:,4];

I get (loading the first time)

  0.026237 seconds (36 allocations: 1.656 KiB)
  0.045077 seconds (35 allocations: 8.002 MiB)
  0.033074 seconds (36 allocations: 1.656 KiB)
  0.038558 seconds (35 allocations: 8.002 MiB)

If the data is in the libnetcdf chunk cache (using @btime), I get:

julia> @btime ds["data"][512,512,3];
  15.549 μs (36 allocations: 1.66 KiB)

julia> @btime ds["data"][:,:,4];
  421.606 μs (35 allocations: 8.00 MiB)

julia> @btime ds["datac"][512,512,3];
  15.361 μs (36 allocations: 1.66 KiB)

julia> @btime ds["datac"][:,:,4];
  417.527 μs (35 allocations: 8.00 MiB)

Feel free to make other benchmarks with other indices.
Another use case is to load data over the network via opendap or HTTP byte ranges request, you certainly do not want to transfer more data than necessary.

Feature Request: cache chunk data

It will be more efficient to cache the current chunk in use when using a for loop to iterate over the data because each reading has a non-neglegible overhead (e.g. HDF5.jl).

Backend generalisation

@meggart instead of using Flatten.jl we could have a mechanism for propagating the engine trait between DiskArray layers? It might be more robust.

So they can all just forward to the parent disk object, but e.g BroadcastDiskArray might need to store the trait somewhere.

DiskArrays.jl vs BlockArrays.jl

Hey,

this is probably old news for you, but I just discovered BlockArrays.jl. If I understand correctly, BlockArrays.jl implements a lot of, if not all the features we need for DiskArrays.jl.

Does this make DiskArrrays.jl obsolote?

Sorry, if I'm going through the same thought process you already covered long ago, but may this can help.

Fix method ambiguities

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [4c88cf16] Aqua v0.5.5
  [3c3547ce] DiskArrays v0.3.6

julia> using DiskArrays, Aqua

julia> Aqua.test_all(DiskArrays)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
13 ambiguities found
Ambiguity #1
copyto!(dest::AbstractMatrix, Rdest::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, src::SparseArrays.AbstractSparseMatrixCSC{T}, Rsrc::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:481
copyto!(dest::DiskArrays.AbstractDiskArray, Rdest::CartesianIndices, src::AbstractArray, Rsrc::CartesianIndices) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:8

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}, ::SparseArrays.AbstractSparseMatrixCSC{T}, ::CartesianIndices{2, R} where R<:Tuple{OrdinalRange{Int64, Int64}, OrdinalRange{Int64, Int64}}) where T

Ambiguity #2
reverse(a::DiskArrays.AbstractDiskArray, dims) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:23
reverse(A::AbstractVector, start::Integer) in Base at array.jl:1803

Possible fix, define
  reverse(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Integer)

Ambiguity #3
copyto!(dest::AbstractArray, D::SuiteSparse.CHOLMOD.Dense) in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:830
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray, ::SuiteSparse.CHOLMOD.Dense)

Ambiguity #4
copyto!(A::AbstractMatrix, B::SparseArrays.AbstractSparseMatrixCSC) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsematrix.jl:458
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::SparseArrays.AbstractSparseMatrixCSC)

Ambiguity #5
copyto!(dest::AbstractMatrix, src::LinearAlgebra.AbstractQ) in LinearAlgebra at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:599
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2} where T, ::LinearAlgebra.AbstractQ)

Ambiguity #6
commonlength(::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:154
commonlength(a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:155

Possible fix, define
  commonlength(::Tuple{}, ::Tuple{})

Ambiguity #7
_commonlength(a1, b1, ::Tuple{}, b) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:157
_commonlength(a1, b1, a, ::Tuple{}) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/diskarray.jl:158

Possible fix, define
  _commonlength(::Any, ::Any, ::Tuple{}, ::Tuple{})

Ambiguity #8
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray, src::AbstractArray) in Base.PermutedDimsArrays at permuteddimsarray.jl:215

Possible fix, define
  copyto!(::PermutedDimsArray, ::DiskArrays.AbstractDiskArray)

Ambiguity #9
copyto!(dest::AbstractArray{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:828
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}

Ambiguity #10
copyto!(A::SparseArrays.SparseVector, B::AbstractVector) in SparseArrays at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SparseArrays/src/sparsevector.jl:501
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6

Possible fix, define
  copyto!(::SparseArrays.SparseVector, ::DiskArrays.AbstractDiskArray{T, 1} where T)

Ambiguity #11
copyto!(dest::AbstractArray, source::DiskArrays.AbstractDiskArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:6
copyto!(dest::PermutedDimsArray{T, N}, src::AbstractArray{T, N}) where {T, N} in Base.PermutedDimsArrays at permuteddimsarray.jl:211

Possible fix, define
  copyto!(::PermutedDimsArray{T, N}, ::DiskArrays.AbstractDiskArray{T, N}) where {T, N}

Ambiguity #12
_reshape(parent::DiskArrays.AbstractDiskArray, dims::Tuple{Vararg{Int64, var"#164#N"}}) where var"#164#N" in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/reshape.jl:79
_reshape(v::AbstractVector, dims::Tuple{Int64}) in Base at reshapedarray.jl:168

Possible fix, define
  _reshape(::DiskArrays.AbstractDiskArray{T, 1} where T, ::Tuple{Int64})

Ambiguity #13
copyto!(dest::AbstractMatrix{T}, D::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64} in SuiteSparse.CHOLMOD at /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/SuiteSparse/src/cholmod.jl:829
copyto!(dest::DiskArrays.AbstractDiskArray, source::AbstractArray) in DiskArrays at /Users/yeesian/.julia/packages/DiskArrays/bcti2/src/array.jl:5

Possible fix, define
  copyto!(::DiskArrays.AbstractDiskArray{T, 2}, ::SuiteSparse.CHOLMOD.Dense{T}) where T<:Union{Float64, ComplexF64}

Method ambiguity: Test Failed at /Users/yeesian/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
  Expression: success(pipeline(cmd; stdout = stdout, stderr = stderr))
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] _test_ambiguities(packages::Vector{Base.PkgId}; color::Nothing, exclude::Vector{Any}, detect_ambiguities_options::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Aqua ~/.julia/packages/Aqua/HWLbM/src/ambiguities.jl:117
Test Summary:    | Fail  Total  Time
Method ambiguity |    1      1  1.7s
ERROR: Some tests did not pass: 0 passed, 1 failed, 0 errored, 0 broken.

Implement reverse indexing

zarr[end:-1:1, :, :] not working

There is probably some lastindex implementation missing. Should be added as a new indexing method

Fix zip and other cases of simultaneous iteration

zip is used in array equality, and breaks chunking order when arrays with different chunks are used.

We should at minimum fix it so that the chunks of the first disk array found are used for all others (and for regular arrays). This can be achieved with a RechunkDiskArray wrapper that simply forces a different chunk pattern. It may be slow in some cases but will at least be correct.

Later we could optimise the combined iteration pattern of all diskarrays passed to zip.

dimensions get moved around in unintuitive ways when accessing DiskArray using CartesianIndex

I noticed the the dimensions get moved around in unintuitive ways

dc = Zarr.zopen(path2zarr)

size(dc["v"])
(834, 834, 84396)

r = [1 4 6 1]
c = [7 8 9 3]

cartind = CartesianIndex.(r,c)

size(dc["v"][cartind, :])
(1, 84396, 4)

cartind = CartesianIndex.(r',c')

size(dc["v"][cartind, :])
(4, 84396, 1)

This is different behavior that non DiskArrays

z = rand(100,100,1000)

cartind = CartesianIndex.(r,c)

size(z[cartind,:])
(1, 4, 1000)

cartind = CartesianIndex.(r',c')
size(z[cartind,:])
(4, 1, 1000)

Implement PermutedDimsArray for DiskArrays

Hi, I noticed bad performance when using PermutedDimsArray with ArchGDAL and the issue probably is located here. See yeesian/ArchGDAL.jl#225, which explains the problem. However I don't see an easy fix like proposed over there, as it runs into a stack overflow for AbstractDiskArrays.

julia 1.9/nightly error: type Stateful has no field taken

ArchGDAL is currently failing tests on 1.9/nightly with "type Stateful has no field taken".
That seems to be due to the .taken here:

DiskArrays.jl/src/iterator.jl

Lines 53 to 60 in 68c8150

 cistateold = chunkiter.taken 

 biter = iterate(bi, bstate) 

 if biter === nothing 

 return nothing 

 end 

 innernow, bstatenew = biter 

 (chunkiter, innerinds) = bstatenew 

 if chunkiter.taken !== cistateold

This field was removed in JuliaLang/julia#45924.

This is the full error message from ArchGDAL.

Complex IO: Error During Test at /home/runner/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294
  Test threw exception
  Expression: AG.getband(ds_copy, 1) == a
  type Stateful has no field taken
  Stacktrace:
   [1] getproperty
     @ ./Base.jl:37 [inlined]
   [2] iterate(a::ArchGDAL.IRasterBand{ComplexF32}, i::Tuple{OffsetArrays.OffsetMatrix{ComplexF32, Matrix{ComplexF32}}, DiskArrays.BlockedIndices{DiskArrays.GridChunks{2}}, Tuple{Base.Iterators.Stateful{DiskArrays.GridChunks{2}, Union{Nothing, Tuple{Any, Tuple{Any, Vararg{Any}}}}, Int64}, Base.Iterators.Stateful{CartesianIndices{2, Tuple{UnitRange{Int64}, UnitRange{Int64}}}, Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}, Int64}}})
     @ DiskArrays ~/.julia/packages/DiskArrays/VGEpq/src/iterator.jl:53
   [3] _zip_iterate_some
     @ ./iterators.jl:424 [inlined]
   [4] _zip_iterate_all
     @ ./iterators.jl:416 [inlined]
   [5] iterate
     @ ./iterators.jl:407 [inlined]
   [6] ==(A::ArchGDAL.IRasterBand{ComplexF32}, B::Matrix{ComplexF32})
     @ Base ./abstractarray.jl:2919
   [7] eval_test(evaluated::Expr, quoted::Expr, source::LineNumberNode, negate::Bool)
     @ Test /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:320
   [8] macro expansion
     @ /opt/hostedtoolcache/julia/nightly/x64/share/julia/stdlib/v1.10/Test/src/Test.jl:477 [inlined]
   [9] (::var"#391#393"{Matrix{ComplexF32}})(ds_copy::ArchGDAL.Dataset)
     @ Main ~/work/ArchGDAL.jl/ArchGDAL.jl/test/test_rasterio.jl:294

It's not directly clear to me what the right fix is. Is that calling length() on the iterator?

Use TiledIteration.jl?

For GridChunks, we could use TiledIteration.jl instead of implementing everything here. In general, I wonder if TiledIteration.jl couldn't become the interface module for all tiled array iteration (if it isn't already).

I have an implementation of iteration using TiledIteration.jl in HDF5Arrays.jl, so I could make a PR, if there's interest.

If we limit ourselves to tiles (GridChunks in DiskArrays.jl), we can also use tiled indexing to make a lot of the code implementing broadcasting, reduce operations, etc. cleaner. Further, the code could be moved to TiledIteration.jl, so non-DiskArrays can benefit from the tools here, too. See this proposal for tiled indexing

Example in README.md

Hi Fabian, when I have some issue with the integration of NCDatasets/DiskArrays and I would like to know a bit better how DiskArrays works independently of NCDatasets. I noticed in the example in your README.md that RangeArray is not defined.

julia> a = RangeArray(4,5,1)
ERROR: UndefVarError: RangeArray not defined

Maybe it should be something like a = PseudoDiskArray(randn(4,5,1)) if I understand the example.

ConstructionBase.jl dep

Would it be possible to add a ConstructionBase.jl dependency here? (its tiny)

The reason is DiskArrays.SubDiskArray and probably other objects here can't be reconstructed from their fields alone - they need type information as well. ConstructionBase.jl defines constructorof to handle this. Then @set from Setfield.jl and reconstruct from Flatten.jl (and BangBang.jl, Accessors.jl etc) can update their fields with new objects.

My use case is opening GeoData.FileArray with write access through nested DiskArrays.SubDiskArray and whatever else is wrapping it.

As you don't know the number of wrappers or what they are it's hard to do this manually. Flatten.modify can just do it automatically and rewrap the whole stack however deep it is. But only if ConstructionBase.constructorof works for all the wrappers.

The option is to put it in ConstructionBaseExtras.jl. But using Requires.jl and a wrapper package is a half-baked measure without dependency bounds, so I like to ask for a direct dep first.

Handle concatenation with `cat`

Currently cat is very slow, and may even give the wrong anwer?

See: rafaqz/Rasters.jl#298

It could be more efficient to allocate the final array and then broadcast into views of it, instead of copying each disk array to memory separately.

Iteration is broken with chunking

Iteration seems to be broken with chunking, so when fallbacks are used by base methods you just get the wrong answer, without an error. So e.g. Array(diskarray) currently uses iteration, and can give the wrong result:

julia>   a = collect(reshape(1:90,10,9))
10×9 Matrix{Int64}:
  1  11  21  31  41  51  61  71  81
  2  12  22  32  42  52  62  72  82
  3  13  23  33  43  53  63  73  83
  4  14  24  34  44  54  64  74  84
  5  15  25  35  45  55  65  75  85
  6  16  26  36  46  56  66  76  86
  7  17  27  37  47  57  67  77  87
  8  18  28  38  48  58  68  78  88
  9  19  29  39  49  59  69  79  89
 10  20  30  40  50  60  70  80  90

julia>   a_disk = _DiskArray(a, chunksize=(5,3))
Disk Array with size 10 x 9

julia>   Array(a_disk)
10×9 Matrix{Int64}:
  1  21  16  31  51  46  61  81  76
  2  22  17  32  52  47  62  82  77
  3  23  18  33  53  48  63  83  78
  4  24  19  34  54  49  64  84  79
  5  25  20  35  55  50  65  85  80
 11   6  26  41  36  56  71  66  86
 12   7  27  42  37  57  72  67  87
 13   8  28  43  38  58  73  68  88
 14   9  29  44  39  59  74  69  89
 15  10  30  45  40  60  75  70  90

#37 may be a partial solution to this. But maybe also good to use other methods as much as possible so the iteration fallback isn't used.

lazy reverse(a; dims=n) with keyword dimension

It would be good to dispatch on the DiskArray implementation for the lazy reverse when one specifies the dimension by a keyword argument.
It dispatches directly if the dimension is the second positional argument, but the keyword argument is dispatched into the default Base implementation.

a:: DiskArray

julia> reverse(a, dims=1) # Non lazy
10000×10000 Matrix{UInt8}:
 0x00  0x00  …  0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00  …  0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
    ⋮        ⋱        
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00  …  0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00
 0x00  0x00     0x00  0x00

julia> reverse(a, 1) # Lazy 
Disk Array with size 10000 x 10000

What should eachchunk return for an unchunked array?

Should it be nothing, or just the entire array, or should it throw a method error? Or something else?

Implementing in other packages

I think the general indexing functionality for this package is there now, so that one could use this package as a dependency for NetCDF, HDF5 or Zarr without losing functionality. Here we can colltect a list to the packages that might profit from DiskArrays and link to the respective PRs:

NetCDF.jl JuliaGeo/NetCDF.jl#111
Zarr.jl JuliaIO/Zarr.jl#24
NCDatasets.jl
HDF5.jl (Basically through HDF5Utils.jl)
ArchGDAL.jl yeesian/ArchGDAL.jl#105

@visr @rafaqz @Alexander-Barth any suggestions and feedback would be highly appreciated

Lazy mean, sum etc?

It seems possible to add a reducing function application wrapper that will take e.g. the mean/sum etc along an axis lazily, similar to how lazy broadcasting works.

	function subsetchunks(r::RegularChunks, subs::AbstractRange)
	# This is a method only to make "reverse" work and should error for all other cases
	if step(subs) == -1 && first(subs) == r.s && last(subs) == 1
	lastlen = length(last(r))
	newoffset = r.cs - lastlen
	return RegularChunks(r.cs, newoffset, r.s)
	end
	end

	cistateold = chunkiter.taken
	biter = iterate(bi, bstate)
	if biter === nothing
	return nothing
	end
	innernow, bstatenew = biter
	(chunkiter, innerinds) = bstatenew
	if chunkiter.taken !== cistateold

meggart / diskarrays.jl Goto Github PK

diskarrays.jl's People

Contributors

Stargazers

Watchers

Forkers

diskarrays.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs