GithubHelp home page GithubHelp logo

Comments (6)

sangmichaelxie avatar sangmichaelxie commented on August 27, 2024

Thanks for the question! The copy is mainly there for protection against writing into the original memmapped data array, since pytorch doesn’t support immutable tensors. However, if you’re sure that the algorithm won’t do any in place operations or write into the input tensor (like when ‘no_nl = True’) then it should be safe to not copy. But in general it isn’t safe, so for now we’d like to have the copy.

My other thought is that it should scale well with the number of data loader workers. Have you tried just increasing the number of prefetching threads in the data loader?

from wilds.

dmadras avatar dmadras commented on August 27, 2024

Yeah, totally agree it makes sense to be copying! Trying to increase the number of workers now - it looks like the time per minibatch is roughly constant as num_workers increases (is this the keyword argument you meant?). The loading times are quite uneven between minibatches when this number increases, so sometimes have to aggregate times across sets of minibatches:

num_workers = 0 -> 5-6s for 1 minibatch
num_workers = 1 -> 6-7s for 1 minibatch
num_workers = 2 -> 14-15s for 2 minibatches
num_workers = 3 -> 27-30s for 3 minibatches
num_workers = 4 -> 32-40s for 4 minibatches

That said I'm just testing now, I think that maybe when I get it up and running with a model, this will become less noticeable (esp in the multi-threaded setting) but not clear yet. However, I'd be curious if you had any recommendations in terms of other methods for loading the data in (possibly other formats you tried?)

from wilds.

sangmichaelxie avatar sangmichaelxie commented on August 27, 2024

Thanks for doing the test with num_workers. Hm, I'll look into this soon myself and get back to you on faster methods for loading. The numpy file is about 30G so we were technically able to fit the array in RAM during our tests (with RAM large enough) and set cache_counter to 9999999 - in this case after the first epoch everything is cached by numpy, I believe. If you can increase your RAM, maybe that could help. This isn't a solution for everyone, of course.

from wilds.

dmadras avatar dmadras commented on August 27, 2024

Okay thanks - I haven't tried playing with cache_counter, I'll take a look at doing that and see how high I can get.

from wilds.

sangmichaelxie avatar sangmichaelxie commented on August 27, 2024

Sorry I misspoke, I meant the variable called cache_size, not cache_counter. You may want to make the change that's in this commit in one of the open PR's if you can make your RAM large enough to fit the entire file: d308c0b
This way if you set the cache_size to a large number, it won't try to refresh the cache anymore.

from wilds.

ssagawa avatar ssagawa commented on August 27, 2024

Thanks again for flagging this issue and for your patience! We have updated PovertyMap-WILDS in our v1.1 release to address this. We no longer use memmap, and we’ve instead converted the npy files (losslessly) into individual compressed .npz files to improve disk I/O and memory usage. The underlying data is the same, so results should not be affected. Thank you!

from wilds.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.