GithubHelp home page GithubHelp logo

lorenzoh / dataloaders.jl Goto Github PK

View Code? Open in Web Editor NEW
73.0 73.0 9.0 6.42 MB

A parallel iterator for large machine learning datasets that don't fit into memory inspired by PyTorch's `DataLoader` class.

Home Page: https://lorenzoh.github.io/DataLoaders.jl/docs/dev/interactive

License: MIT License

Julia 100.00%

dataloaders.jl's Introduction

Hi, I'm Lorenz ๐Ÿ‘‹

I am passionate about computer vision, deep learning and software design.

Without a particular order, things I am interested in: computer vision, software, martial arts, API design, Neapolitan pizza, procedural generation, pen & paper, developer experience, technical writing, reading, pose estimation and more that I might find the time to add later ๐Ÿ˜….

If you want to reach out, feel free to message me on my Twitter, @holylorenzo!

Here on GitHub, I mainly work on better tooling for deep learning in Julia. To do this, I've authored several Julia libraries, including:

  • FastAI.jl, a high-level interface for complete deep learning projects (approved by fastai creators) (see the announcement post),
  • FluxTraining.jl, an extensible training loop with best practices,
  • DataLoaders.jl, an efficient, parallel data loader for larger-than-memory datasets
  • DataAugmentation.jl: high-performance, composable data augmentations for 2D and 3D spatial data like images, segmentation masks, and keypoints
  • Pollen.jl, a documentation system to build beautiful documentation pages for your libraries (see this example page)

I'm also a member of the FluxML organization. If you're interested in contributing, come say hi on the Zulip channel or on the biweekly ML ecosystem call (find the link here).

dataloaders.jl's People

Contributors

lorenzoh avatar nikopj avatar visr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dataloaders.jl's Issues

MWE - Error on single dim targets

Here is the MWE for the error that I encountered in #14:

x = rand(10, 10000)
y = rand(10000)
dataloader = DataLoader((x, y ), BATCH)

In my dataset, the targets (y's) are just integers.
However, it seems data loader can't deal with that.

How would I resolve that?

Thank you!

Really hard to understand error...

Something is happening if I try to use a DataLoader on my custom DataContainer...
I can't really understand if this is my fault or something to do with DataLoaders. It also hangs and does not terminate after this:

Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #63 at qpool.jl:195 [inlined], (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86]
โ”Œ Error: Exception while executing task on worker 5. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 1
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 3. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 4
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 4. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 7
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 2. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 2
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 8. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 6
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 6. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 3
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 7. Shutting down WorkerPool.
โ”‚   e =
โ”‚    MethodError: no method matching copyrec!(::SubArray{Int64,0,Array{Int64,1},Tuple{Int64},true}, ::Int64)
โ”‚    Closest candidates are:
โ”‚      copyrec!(::AbstractArray, ::AbstractArray) at /home/andriy/.julia/packages/DataLoaders/uGlPg/src/batchview.jl:112
โ”‚   stacktrace =
โ”‚    6-element Array{Base.StackTraces.StackFrame,1}:
โ”‚     macro expansion at logging.jl:332 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #63 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#59#60"{ThreadPools.var"#63#65"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}},Tuple{Int64,Int64}})() at qpool.jl:86
โ”‚   args = 5

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

MWE for the crash

Ok, finally figured out what cases #14 to crash:

using DataLoaders

features = rand(2)
dataloader = DataLoader((features, features), 100, useprimary = true)

for (xs, ys) in dataloader
    print(size(xs))
    print(size(ys))
    break
end

You will see it crashes during the loop, and then hangs inside of it.

WebDataset.jl and comments

PyTorch is currently rearchitecting their I/O pipelines because the indexed datasets don't scale well to large learning problems. Many pipelines will likely be based on IterableDataset.

The changes to PyTorch are related to our WebDataset library (github.com/tmbdev/webdataset), which demonstrably provides linearly scalable I/O for large scale deep learning.

I have recently written a first implementation of WebDataset.jl that can read the same format; it provides multithreaded I/O and decoding, as well as hooks for sharding and shuffling. It's at github.com/tmbdev/WebDataset.jl

As an aside, the use of tuples to represent training data in PyTorch (rather than structs/records/dicts) has been a perennial problem for reusing and debugging I/O pipelines, and probably not a design that should be carried over.

Anyway, my suggestion would be to have a look at WebDataset.jl and see whether it or parts of it are useful in the architecture of future dataloaders.

Reproductivity problem with multi-threading

When I used this DataLoaders.jl in my project especially deep learning, I encountered a reproductivity problem with multi-threading is enabled. Below is a MWE that describes our issue. Here, MyDataset returns idx from which comes the 2nd argument of getobs method.

# example.jl
module My

import DataLoaders.LearnBase: getobs, nobs
using Random

struct MyDataset
    ndata::Int
end

Base.getindex(dset::MyDataset, idx) = idx
getobs(dset::MyDataset, idx) = dset[idx]
nobs(dset::MyDataset) = dset.ndata

end # My

using DataLoaders
using Random

using .My

MyDataset = My.MyDataset

ntrial = 3

for t in 1:ntrial
    dset = MyDataset(10000) # create an instance of MyDataset
    loader = DataLoader(dset, 100) # setup loader
    for batch in loader
        @show batch # <------
        println()
        break
    end
end

From my understanding, for each t in 1:ntrial, @show batch should display array from 1 to 100 namely:

batch = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

On the other hand, the actual behavior of the example.jl script above will output something like:

$ julia --threads=12 example.jl # num thread = 12
batch = [301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400]

batch = [401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500]

batch = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

This phenomena happens when we specify the number of threads more than 1.

Non-parallel access on multi-thread

I'm looking for the option to use a DataLoader with guaranteed non-shuffling when in a in a multi-thread Julia session.
I noticed that the parallel keyword argument has been deprecated, though I'm not sure if the option was enforced when on a multi-thread session.

The issue is that in a scenario where inference on a large data is performed, it is then desirable to guarantee the order of the iterations so that the predictions are in the same order as the original data. However, this apparently cannot be achieved if num_threads > 1.

MWE:

using DataLoaders
import LearnBase: nobs, getobs

struct MyContainer{S <: AbstractArray}
    x::S
    length::Int
end

nobs(data::MyContainer) = ceil(Int, size(data.x, 1) / data.length)

function getobs(data::MyContainer, idx::Int)
    println("get obs MyContainer - idx: ", idx)
    x = if idx < nobs(data)
        data.x[((idx - 1) * data.length + 1):(idx * data.length), :]
    else
        data.x[((idx - 1) * data.length + 1):end, :]
    end
    return x
end

x = rand(10,2)
data = MyContainer(x, 4)
dloader = DataLoaders.DataLoader(data, nothing)

Then, randomness can be observed in the batch order:

julia> for x in dloader
           println("size(x): ", size(x))
       end
get obs MyContainer - idx: 3
get obs MyContainer - idx: 2
size(x): (2, 2)
get obs MyContainer - idx: 1
size(x): (4, 2)
size(x): (4, 2)

julia> for x in dloader
           println("size(x): ", size(x))
       end
get obs MyContainer - idx: 1
get obs MyContainer - idx: 3
get obs MyContainer - idx: 2
size(x): (4, 2)
size(x): (2, 2)
size(x): (4, 2)

Is it possible to enforce the returned idx to always be 1,2,3? Having the option to disable the multi-threaded fetch would do it. Not sure if it would be feasible to let the multi-processing in place but wait to return the result after the previous id has been completed?

Iterating over a simple `DataLoader` crashes at end of iteration

New to DataLoaders, so maybe missing something here.

(jl_C01aMc) pkg> st
      Status `/private/var/folders/4n/gvbmlhdc8xj973001s6vdyw00000gq/T/jl_C01aMc/Project.toml`
  [2e981812] DataLoaders v0.1.2

julia> x = rand(3, 20);

julia> batches = DataLoader(x, 2, collate=false, partial=false);

julia> for b in batches
           println(size(b))
       end
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #62 at qpool.jl:195 [inlined], (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #62 at qpool.jl:195 [inlined], (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #62 at qpool.jl:195 [inlined], (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86]
Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:69 [inlined], #8 at macros.jl:19 [inlined], #62 at qpool.jl:195 [inlined], (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86]
โ”Œ Error: Exception while executing task on worker 5. Shutting down WorkerPool.
โ”‚   e = BoundsError: attempt to access 3ร—20 Matrix{Float64} at index [1:3, 23:24]
โ”‚   stacktrace =
โ”‚    6-element Vector{Base.StackTraces.StackFrame}:
โ”‚     macro expansion at logging.jl:341 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #62 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86
โ”‚   args = 12
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 4. Shutting down WorkerPool.
โ”‚   e = BoundsError: attempt to access 3ร—20 Matrix{Float64} at index [1:3, 27:28]
โ”‚   stacktrace =
โ”‚    6-element Vector{Base.StackTraces.StackFrame}:
โ”‚     macro expansion at logging.jl:341 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #62 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86
โ”‚   args = 14
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 3. Shutting down WorkerPool.
โ”‚   e = BoundsError: attempt to access 3ร—20 Matrix{Float64} at index [1:3, 25:26]
โ”‚   stacktrace =
โ”‚    6-element Vector{Base.StackTraces.StackFrame}:
โ”‚     macro expansion at logging.jl:341 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #62 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86
โ”‚   args = 13
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
โ”Œ Error: Exception while executing task on worker 2. Shutting down WorkerPool.
โ”‚   e = BoundsError: attempt to access 3ร—20 Matrix{Float64} at index [1:3, 21:22]
โ”‚   stacktrace =
โ”‚    6-element Vector{Base.StackTraces.StackFrame}:
โ”‚     macro expansion at logging.jl:341 [inlined]
โ”‚     (::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:56
โ”‚     macro expansion at workerpool.jl:69 [inlined]
โ”‚     #8 at macros.jl:19 [inlined]
โ”‚     #62 at qpool.jl:195 [inlined]
โ”‚     (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#8#14"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86
โ”‚   args = 11
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/uGlPg/src/workerpool.jl:56
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.0.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.5.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia

slidingwindow causes DataLoader to slow down

If I use a slidingwindow from MLDataPattern, the data loader seems to take a long time to instantiate...

x = rand(128, 300, 10000)  #  10000 observations of size 128
y = rand(1, 300, 10000)

WINDOW_SIZE = 50
BATCH = 100
x = slidingwindow(x, WINDOW_SIZE)
y = slidingwindow(y, WINDOW_SIZE)
dataloader = DataLoader((x, y ), BATCH)       # this takes a Looong time

Vague out of bounds error when using sparse vectors

I am running into an issue where the WorkerPool is shut down due to an out of bounds index error when trying to iterate over a DataLoader. The issue only comes up when I am using sparse CUDA vectors in part of what getobs returns. I can provide information, but at the moment I am somewhat uncertain what would be helpful since the error message too cryptic for me. Said error is given below:

Base.StackTraces.StackFrame[(::DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}})(args::Int64) at workerpool.jl:55, macro expansion at workerpool.jl:65 [inlined], #6 at macros.jl:19 [inlined], #62 at qpool.jl:195 [inlined], (::ThreadPools.var"#58#59"{ThreadPools.var"#62#64"{DataLoaders.var"#6#12"{DataLoaders.var"#inloop#10"{DataLoaders.WorkerPool{Int64}}}}, Tuple{Int64, Int64}})() at qpool.jl:86]
โ”Œ Error: Exception while executing task on worker 1. Shutting down WorkerPool.
โ”‚   e = ArgumentError: An index is out of bound.
โ”‚   stacktrace =
โ”‚    6-element Vector{Base.StackTraces.StackFrame}:
โ”‚     macro expansion at logging.jl:341 [inlined]
โ”‚     โ‹ฎ
โ”‚   args = 1
โ”” @ DataLoaders ~/.julia/packages/DataLoaders/f2y29/src/workerpool.jl:56
^CERROR: LoadError: InterruptException:
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:710
  [2] wait()
    @ Base ./task.jl:770
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:106
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:389
  [5] take!
    @ ./channels.jl:383 [inlined]
  [6] iterate
    @ ~/.julia/packages/DataLoaders/f2y29/src/loaders.jl:39 [inlined]
  [7] iterate(iterparallel::DataLoaders.GetObsParallel{DataLoaders.BatchViewCollated{TrainMNIST}})
    @ DataLoaders ~/.julia/packages/DataLoaders/f2y29/src/loaders.jl:25

Semi-related to this is the fact that collate does not behave properly with sparse CUDA arrays, as cat converts to a dense array. This was solved with the following:

DataLoaders.collate(samples::AbstractVector{<:CUDA.CUSPARSE.CuSparseVector{T, N}}) where {T, N} = hcat(samples...)

but the issue above was present before and after.

Thanks for the help!

Doc typo

Screen Shot 2021-08-26 at 2 27 22 PM

The last line of code makes no reference to the previous line. I guess shuflleobs(data) in the last line should be something else.

README example fails

I came across your julia package and wanted to try your Julia package on my project. Unfortunately it does not on my environment. Here is what I did.

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.2 (2020-09-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = subl
  JULIA_NUM_THREADS = 16
  JULIA_PROJECT = @.

julia> using DataLoaders

julia> x = rand(128, 10000);  #  10000 observations of size 128

julia> y = rand(1, 10000);

julia> dataloader = DataLoader((x, y), 16);

julia> for (i, (xs, ys)) in enumerate(dataloader)
           @assert size(xs) == (128, 16) "(i=$i, size(xs)=$(size(xs)))"
           @assert size(ys) == (1, 16) "(i=$i, size(xs)=$(size(xs)))"
       end
ERROR: AssertionError: (i=625, size(xs)=(128, 0))
Stacktrace:
 [1] top-level scope at ./REPL[6]:2

It seems the last batch returns 0-size data.

Rename/PR to MLDataPattern.jl

Great work!

I've been working on a similar idea, and I was wondering if you would consider making this work a PR to MLDataPattern.jl? The key feature here is the async iterator, and I was planning on adding such an iterator to MLDataPattern.jl. Features like collation and batching can be done as modifications to the existing BatchView in MLDataPattern.jl.

Tuples don't work!

Julia Versions tried: 1.5.2, 1.5.3. Ubuntu. 20.04.

If I run the tutorial example I get this error (note that it works if I just pass x but not if I pass the tuple (x,y)):

using DataLoaders

x = rand(128, 10000)  #  10000 observations of size 128
y = rand(1, 10000)

dataloader = DataLoaders.DataLoader((x, y), 16)

for (xs, ys) in dataloader
    @assert size(xs) == (128, 16)
    @assert size(ys) == (1, 16)
end

Error:

ERROR: MethodError: no method matching iterate(::MLDataPattern.DataSubset{DataLoaders.BatchViewCollated{Tuple{Array{Float64,2},Array{Float64,2}}},Int64,LearnBase.ObsDim.Undefined})
Closest candidates are:
  iterate(!Matched::LibGit2.GitRebase) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LibGit2/src/rebase.jl:48
  iterate(!Matched::LibGit2.GitRebase, !Matched::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LibGit2/src/rebase.jl:48
  iterate(!Matched::ThreadPools.ResultIterator) at /home/andriy/.julia/packages/ThreadPools/hPQNy/src/qpool.jl:172
  ...
Stacktrace:

_zip_iterate_some at ./iterators.jl:352

_zip_iterate_some at ./iterators.jl:354

_zip_iterate_all at ./iterators.jl:344

iterate at ./iterators.jl:334

iterate at ./generator.jl:44

collect(::Base.Generator{Base.Iterators.Zip{Tuple{Tuple{Array{Float64,2},Array{Float64,2}},MLDataPattern.DataSubset{DataLoaders.BatchViewCollated{Tuple{Array{Float64,2},Array{Float64,2}}},Int64,LearnBase.ObsDim.Undefined}}},Base.var"#3#4"{typeof(LearnBase.getobs!)}}) at ./array.jl:686

map at ./abstractarray.jl:2248

iterate(::MLDataPattern.BufferGetObs{Tuple{Array{Float64,2},Array{Float64,2}},MLDataPattern.ObsView{MLDataPattern.DataSubset{DataLoaders.BatchViewCollated{Tuple{Array{Float64,2},Array{Float64,2}}},Int64,LearnBase.ObsDim.Undefined},DataLoaders.BatchViewCollated{Tuple{Array{Float64,2},Array{Float64,2}}},LearnBase.ObsDim.Undefined}}) at /home/andriy/.julia/packages/MLDataPattern/KlSmO/src/dataiterator.jl:524

top-level scope at /home/andriy/Projects/Projectjl/src/02.dataset.jl:17

include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./essentials.jl:710

invokelatest(::Any, ::Any, ::Vararg{Any,N} where N) at ./essentials.jl:709

inlineeval(::Module, ::String, ::Int64, ::Int64, ::String; softscope::Bool) at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:185

(::VSCodeServer.var"#61#65"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:144

withpath(::VSCodeServer.var"#61#65"{String,Int64,Int64,String,Module,Bool,VSCodeServer.ReplRunCodeRequestParams}, ::String) at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/repl.jl:124

(::VSCodeServer.var"#60#64"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:142

hideprompt(::VSCodeServer.var"#60#64"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams}) at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/repl.jl:36

(::VSCodeServer.var"#59#63"{String,Int64,Int64,String,Module,Bool,Bool,VSCodeServer.ReplRunCodeRequestParams})() at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:110

with_logstate(::Function, ::Any) at ./logging.jl:408

with_logger at ./logging.jl:514

(::VSCodeServer.var"#58#62"{VSCodeServer.ReplRunCodeRequestParams})() at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:109

#invokelatest#1 at ./essentials.jl:710

invokelatest(::Any) at ./essentials.jl:709

macro expansion at /home/andriy/.vscode/extensions/julialang.language-julia-1.0.10/scripts/packages/VSCodeServer/src/eval.jl:27

(::VSCodeServer.var"#56#57")() at ./task.jl:356

Output of ]st:

 [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.8.2
  [a93c6f00] DataFrames v0.21.8
  [2e981812] DataLoaders v0.1.1
  [743a1d0a] DataTables v0.1.0
  [aae7a2af] DiffEqFlux v1.25.0
  [587475ba] Flux v0.11.2
  [7f8f8fb0] LearnBase v0.3.0
  [9920b226] MLDataPattern v0.5.4
  [cc2ba9b6] MLDataUtils v0.5.2
  [14b8a8f1] PkgTemplates v0.7.13
  [3cdcf5f2] RecipesBase v1.1.1
  [f3b207a7] StatsPlots v0.14.17
  [09ab397b] StructArrays v0.4.4
  [b8865327] UnicodePlots v1.3.0

Error message in inloop should print stacktrace of error origin?

In case the error happens deep down in some generic function it will be easier to track down if catch_backtrace is used, for example like this:

@error "Exception while executing task on worker $(Threads.threadid()). Shutting down WorkerPool." e =
                e stacktrace = stacktrace(catch_backtrace()) args = args 

Slicing for DataLoaders

I should like to see enhancements in the MLJ ecosystem that allow models to work with out-of-memory data, and wonder if DataLoaders might be a good tool here. The main issue, as far as I can tell, is around observation resampling, which is the basis of performance evaluation, and by corollary, hyper-parameter optimization - meta-algorithms that can be applied to all current MLJ models.

So, I wondered if it is be possible for this package to implement getindex(::DataLoader, indxs), for indxs isa Union{Colon,Integer,AbstractVector{<:Integer}, returning an object with the same interface.

This could be a new SubDataLoader object, but in any case it would be important for the original eltype to be knowable (assuming eltype it is implemented for the original object, or you add it as a type parameter).

Since the DataLoader type already requires the implementation of the "random accesss" getobs method, this looks quite doable to me.

I realize that for large datasets (the main use case for DataLoaders) resampling is often a simple holdout. However, because Holdout is implemented as a special case of more general resampling strategies (CV, etc) it would be rather messy to add DataLoader support for just that case without the slicing feature.

Would there be any sympathy among current developers for such an enhancement? Perhaps there is an alternative solution to my issue?

BTW, I don't really see how the suggestion in the docs to apply observation resampling before data is wrapped in a DataLoader could really work effectively in the MLJ context, as the idea is that resampling should remain completely automated. (It also seems from the documentation that this requires bringing the data into memory...?) But I maybe I'm missing something there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.