GithubHelp home page GithubHelp logo

charl-ai / stochastic-caching Goto Github PK

View Code? Open in Web Editor NEW
32.0 2.0 0.0 120 KB

A tiny library for stochastic dataset caching in PyTorch.

Home Page: https://charl-ai.github.io/blog/dataloaders

License: MIT License

Python 98.46% Shell 1.54%

stochastic-caching's Introduction

Stochastic Caching

Is your training pipeline data-bottlenecked? Are you looking for a zero-effort speedup? You've come to the right place!

Introducing stocaching, a tiny (1 file) library for stochastic dataset caching in PyTorch.

See this blog post if you want to understand the benefits, motivation, and decisions behind the library.

Feel free to simply copy-paste the module into your projects! Alternatively, you can install with pip:

pip install stocaching

Usage

Adding stochastic caching to your existing datasets is dead simple. Simply follow these two steps:

  1. Create a SharedCache object in the __init__ method of your dataset. You tell SharedCache about the size of your dataset and the maximum amount of space you want the cache to take up. SharedCache then calculates the maximum number of samples that can fit, and allocates that many slots to store data in.

  2. In the __getitem__ method of your dataset, interact with the SharedCache object to either read the data from the cache (if it has already been cached), or write the data to the cache (if it has not yet been cached).

You can get and set items in the cache with x = cache[idx], and cache[idx] = x. You can picture the cache as a list-like structure with a slot for each sample.

When the dataset is too large to cache completely, len(cache) < len(dataset). If you used the getter and setter directly, you would end up with lots of fiddly code to check if idx is in bounds for the cache. We provide two convenience methods get_slot, and set_slot, which allow you to treat the cache as if it has the same length as the dataset. Using get_slot out of bounds of the cache simply returns None. Using set_slot out of bounds is a no-op. These methods are designed to minimise the amount of code you need to write in the __getitem__ method of your dataset.

Advanced: Internally, the cache is simply a single pytorch array, backed by shared memory. You can access the underlying array with the array property. We also keep an auxiliary array in shared memory, which tracks which samples have been cached, which are yet to be cached, and which are out-of-bounds of the cache. You can access it directly with the aux_array property.

Example

import torch
from stocaching import SharedCache
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self):
        super().__init__()

        ... # set up dataset

        dataset_len = N   # number of samples in the full dataset
        data_dims = (C, H, W)   # data dims (not including batch)

        # initialize the cache
        self.cache = SharedCache(
            size_limit_gib=32,
            dataset_len=dataset_len,
            data_dims=data_dims,
            dtype=torch.uint8,
        )

    def __getitem__(self, idx):
        # retrieve data from cache if it's there
        x = self.cache.get_slot(idx)
        # x will be None if the cache slot was empty or OOB
        if x is None:
            x = ... # load data to uint8 tensor from disk
            self.cache.set_slot(idx, x) # try to cache x
        return x

Benchmarks

We run some basic benchmarks for stochastic caching under a realistic workload -- single GPU image classification.

We train mobilenet_v3_small on a 50k sample dataset for two epochs. The reason we use such a tiny model is to ensure that we are in the dataloading-bottlenecked regime. In the first epoch, the stochastic cache is being lazily filled. In the second epoch, the cache is being read from.

We perform two sets of experiments: one with the data on a local HDD, and one with the data being read from another machine on the network. All experiments are run on the same machine: RTX 3090 GPU, i9 10th gen CPU.

In all epochs apart from the first, stochastic caching gives a speedup that scales linearly with the percentage of the dataset being cached. There is a very small overhead in the first epoch (due to filling the cache), but by the end of the second epoch, the speedup from caching more than compensates for this.

Local HDD Remote data

FAQ

How much memory should I allocate to the cache?

As much as you like! The speedup from caching scales linearly with the % of your dataset being cached.

The shared memory is stored in /dev/shm (tmpfs), so this is likely the limiting factor for you. We provide a convenience function get_shm_size to check how large it is. Alternatively, check with df -h.

Most Unix-like systems have /dev/shm pre-set to 50% of your RAM. You can temporarily resize it (e.g. to 128 GiB) by running: mount -o remount,size=128G /dev/shm (warning: do this at your own risk).

How does this interact with augmentations/transforms?

Generally, you don't want to do any random augmentations before caching because the cache will kill the randomness. It's also a good idea to cache data in uint8 format (instead of float32) to save space.

Splitting your transforms/augmentation pipeline into two phases is a good idea. The first phase converts your data to a (possibly resized) uint8 tensor. The output of this phase gets cached. The second phase should do random augmentations, convert to float32, and normalise. This phase happens 'on-line' and the output goes straight into your model.

For an example of how to do this properly, see the implementation in benchmark/dataset.py. You can also read the blog post for more information.

Does this work with multi-GPU (DDP) training?

Almost. I'll push an update to support it soon.

How do I reproduce the benchmarks?

If you feel like it, please reproduce these benchmarks on your setup!

We benchmark the method with the minimal example in the benchmark/ dir. You may perform a single benchmark run like so:

python run_benchmark.py --data-dir="your-data-dir" --cache-limit-gib="your-cache limit"

Set data-dir to a location on an HDD or network drive. The first time you run the code, a random dataset will be generated in this directory. Set cache-limit-gib to 0 to disable caching, or to an integer less than the size of /dev/shm.

By default, the benchmark script generates a dummy dataset, with 50k (3,512,512) jpg images. This takes around 7.5 GiB on disk. Around 9 GiB of shared memory is needed to fully cache the dataset.

All our benchmarks use the default hyperparameters specified in the run_benchmark.py file {batch_size: 256, num_workers: 8, pin_memory: True}.

You can run the entire benchmark sweep like so:

./run_sweep.sh "YOUR_DATA_DIR" "YOUR_OUTPUT_DIR"

You may then reproduce our plots by running:

python analyse_sweep.py --csv_dir="WHERE_YOU_SAVED_THE_OUTPUTS" --fig-save-dir="assets/"

stochastic-caching's People

Contributors

charl-ai avatar

Stargazers

Sofian Mejjoute avatar Tera2Space avatar p0p avatar  avatar Rajat Rasal avatar Emelyan Egorov avatar  avatar  avatar  avatar Andrea Boscolo Camiletto avatar Sergei Sakhonenkov avatar Ratthachat (Jung) avatar Micholik avatar Jaewoo Park avatar Junya Morioka avatar  avatar Adriano D. avatar satojkovic avatar Alex K avatar  avatar Timothy Kassis avatar 爱可可-爱生活 avatar Ben Glocker avatar Jonas Kuntzer avatar Florian avatar  avatar Mélanie Roschewitz (Bernhardt) avatar Cheng avatar  avatar Chen (Cherise) Chen avatar Fabio De Sousa Ribeiro avatar Hadrien Reynaud avatar

Watchers

Kostas Georgiou avatar  avatar

stochastic-caching's Issues

Question about dynamic `data_dim`

Thanks for this great library!
In my application of modeling molecule, each data is a molecule encoding in (SMILES) string.

Each molecule always having distinc number of atoms, and one library is able to extract the 3D coordinate of each atom which I want to cache.

Therefore, for each data, the data_dim shape is rather dynamic of (num_atoms, 3) where num_atoms is change for each molecule, is there a way to use stocaching for this data?

At first, I thought of using Naive dictionary caching, but as you mentioned in the blog, this naive approach will not work for multi-workers dataloader. Another way is to use num_workers=0 for the first epoch as mentioned here
https://discuss.pytorch.org/t/best-practice-to-cache-the-entire-dataset-during-first-epoch/19608/2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.