GithubHelp home page GithubHelp logo

datapipes's Introduction

datapipes

This a proof-of-concept repository on how torch.utils.data.datapipes can be used as basis for torchvision.datasets.

General observations

  • pathlib.Path should be a first-class citizen for paths.
  • dp.iter.LoadFilesFromDisk should have a mode parameter. Forcing rb makes it cumbersome to read from plain text files. Maybe even an opener parameter would be better that defaults to open and respects mode.
  • Files loaded with get_file_binaries_from_pathnames used in dp.iter.LoadFilesFromDisk are never closed.
  • dp.Iter.RoutedDecoder only accepts (path, buffer) inputs, which is not usable for us. Our datasets return a buffer as well as some additional information.
  • It feels weird to call dp.iter.LoadFilesFromDisk for a single file, which is usually the case for our datasets.
  • I'm aware that this is not possible if we are streaming archives, but if that is not the case, we should be able to read specific files from an archive. Some datasets contain metadata in a separate file that should be available as soon as we create the dataset rather than based on luck when it is stream with the other files.
  • dp.iter.Map expects an IterDataPipe rather than a more general Iterable as the other datapipes.
  • Instead of ReadFilesFrom(Tar|Zip) there should be ReadFilesFromArchive that automatically detect the underlying archive type.
  • dp.iter.ReadFilesFrom(Tar|Zip) should be split in ListFilesIn(Tar|Zip) and LoadFilesFrom(Tar|Zip). Most datasets define some splits of the data so that only a part of the data has to be loaded at all. It would be a good idea to drop unused files before we load them.
  • For some reason dp.iter.ReadFilesFrom(Tar|Zip) returns the files in reversed alphabetical order. This makes it weird to align this with corresponding text files, which are usually read from top to bottom.

Datasets

Legend:

  • ✔️ : Fully working
  • ⭕ : Working, but with a significant performance hit
  • ❌ Not working.

For ⭕ and ❌, please check out the README.md in the corresponding folder for details.

torchvision.datasets. Status
Caltech101 ✔️
Caltech256 ✔️
CelebA ✔️
CIFAR10 / CIFAR100 ✔️
CocoDetection / CocoCaptions ✔️
VOCDetection / VOCSegmentation ✔️
LSUN
ImageNet ✔️
HMDB51 ✔️

Notes

  • So far, I think the best approach for datasets with related files is to have each individual datapipe to yield a key for the datapoint as well as the data.

datapipes's People

Contributors

pmeier avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.