Hi - Ran into a bit of an issue with data loading the Povertymap dat

Data loader for PovertyMap is very slow about wilds HOT 6 CLOSED

dmadras commented on August 27, 2024

Data loader for PovertyMap is very slow

from wilds.

Comments (6)

sangmichaelxie commented on August 27, 2024

Thanks for the question! The copy is mainly there for protection against writing into the original memmapped data array, since pytorch doesn’t support immutable tensors. However, if you’re sure that the algorithm won’t do any in place operations or write into the input tensor (like when ‘no_nl = True’) then it should be safe to not copy. But in general it isn’t safe, so for now we’d like to have the copy.

My other thought is that it should scale well with the number of data loader workers. Have you tried just increasing the number of prefetching threads in the data loader?

from wilds.

dmadras commented on August 27, 2024

Yeah, totally agree it makes sense to be copying! Trying to increase the number of workers now - it looks like the time per minibatch is roughly constant as num_workers increases (is this the keyword argument you meant?). The loading times are quite uneven between minibatches when this number increases, so sometimes have to aggregate times across sets of minibatches:

num_workers = 0 -> 5-6s for 1 minibatch
num_workers = 1 -> 6-7s for 1 minibatch
num_workers = 2 -> 14-15s for 2 minibatches
num_workers = 3 -> 27-30s for 3 minibatches
num_workers = 4 -> 32-40s for 4 minibatches

That said I'm just testing now, I think that maybe when I get it up and running with a model, this will become less noticeable (esp in the multi-threaded setting) but not clear yet. However, I'd be curious if you had any recommendations in terms of other methods for loading the data in (possibly other formats you tried?)

from wilds.

sangmichaelxie commented on August 27, 2024

Thanks for doing the test with num_workers. Hm, I'll look into this soon myself and get back to you on faster methods for loading. The numpy file is about 30G so we were technically able to fit the array in RAM during our tests (with RAM large enough) and set cache_counter to 9999999 - in this case after the first epoch everything is cached by numpy, I believe. If you can increase your RAM, maybe that could help. This isn't a solution for everyone, of course.

from wilds.

dmadras commented on August 27, 2024

Okay thanks - I haven't tried playing with cache_counter, I'll take a look at doing that and see how high I can get.

from wilds.

sangmichaelxie commented on August 27, 2024

Sorry I misspoke, I meant the variable called cache_size, not cache_counter. You may want to make the change that's in this commit in one of the open PR's if you can make your RAM large enough to fit the entire file: d308c0b
This way if you set the cache_size to a large number, it won't try to refresh the cache anymore.

from wilds.

ssagawa commented on August 27, 2024

Thanks again for flagging this issue and for your patience! We have updated PovertyMap-WILDS in our v1.1 release to address this. We no longer use memmap, and we’ve instead converted the npy files (losslessly) into individual compressed .npz files to improve disk I/O and memory usage. The underlying data is the same, so results should not be affected. Thank you!

from wilds.

Data loader for PovertyMap is very slow about wilds HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs