Following discussion on <a class="issue-link js-issue-link" data-error-text="Failed to

Here are some questions to stimulate the discussion: <code cla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

There's two ways around this that I see: Don't close it. <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Proposed HDF5-centric dataset refactor,about mila-iqia/fuel

Comments (40)

rizar commented on June 15, 2024

Guys, do you want to make hdf-5 the only supported format in Fuel? I am not very happy to hear that. Can I ask what are the reasons behind this decision?

from fuel.

vdumoulin commented on June 15, 2024

My understanding is that we would still support everything that's supported at the moment, but we would use HDF5 as the data format for the built-in datasets.

from fuel.

vdumoulin commented on June 15, 2024

Here are some questions to stimulate the discussion:

h5py or pytables?
1. Do we commit to support both or only the package we use for built-in datasets?
2. What package will be used for built-in datasets?
What convention will we use for built-in datasets?
1. Do we use one file per dataset split (e.g. dataset_train.hdf5, dataset_test.hdf5), or do we regroup splits in one file as primary subgroups?
2. Do we adopt the convention that sources are named after their array name, or do we reserve a special attribute for source names?

from fuel.

dwf commented on June 15, 2024

h5py seems to be the cleaner option, and has better low-level exposition so
that e.g. if we want to speed things up, we can write custom Cython code
that calls into their Cython code, without incurring the speed penalty of
Python function calls and object unboxing. This is not a priority right now
but it leaves the door open to performance oriented optimizations, whereas
pytables probably makes this an awful lot harder. On the other hand,
pytables has some features that h5py does not: very fast compression, fast
indexing and lookup in table oriented structures, etc. I am not sure I see
a use case for the latter but I could be wrong; one thing might be if it
supports accessing non-contiguous batches in a faster manner or something.
It may be worth retaining some optional bits that require one while
standardizing on the other for most cases.

The train vs. test thing is interesting because it points to a bigger
problem: how to effectively share file handles between dataset instances. I
have a sneaking suspicion that one or both of the libraries may react
negatively to the same file being opened twice by the same program (might
be fine if they're opened in read only mode though), and there may be
performance penalties for doing so. So the case where you split the
canonical training set into train and valid, we have to determine if the
multiple opening thing is an issue and address it if so. This points to
using a delegated object rather than inheriting, though.

I'm unopinionated about the naming issue though adopting the convention of
naming the hdf5 entity after the source name seems like the simplest option.

One more thing to add is that there are lots of hdf5 hyperparameters that
affect data access speed and so on, and my opinion is that we should be
opinionated about the defaults but allow overriding them in fuel-convert.

from fuel.

bartvm commented on June 15, 2024

@rizar The reasoning is that datasets are distributed in a wide array of formats. Some of these are problematic or very slow to read, especially when we can't load them into memory (e.g. large CSV files, large image datasets distributed as individual files). If we try to support all of them we will end up with a confused mixture of datasets, with varying levels of efficiency and support for random access, subsets, loading in-memory or out-of-core, etc.

So the proposal is then to prefer an approach where we convert datasets to HDF5 instead. This gives us a variety of advantages:

The user can choose whether to load the entire dataset into memory, or to read HDF5 from disk (which automatically comes with buffered reading and, at least in PyTables, with intelligent caching)
Ability to load subsets of the file into memory
One well-tested and feature-rich dataset handler instead of having one for CSV, one for NumPy files, one for images, etc. We will still support these of course, but we don't have to support every single feature.
HDF5 allows us to tag dimensions, which could be useful when implementing #13

@vdumoulin I prefer h5py as well, at least to begin with, because it seems cleaner. But in the long term we could maybe support both so that people can make use of some of PyTables' advanced features like Blosc compression.

I am inclined to put all sets in a single file. If we run into a dataset where this is really problematic, for some particular reason, we might still be able to solve that using the external links feature.

@dwf Opening the same file multiple times does indeed seem to be problematic in some cases (see here and here). That's a bit annoying, but I think it's relatively straightforward to solve. In the low-level H5F API there is a function, h5py.h5f.get_obj_ids(), that returns all open HDF5 file objects. So something along these lines should be enough:

dataset_filename = 'test.hdf5'
for dataset in h5py.h5f.get_obj_ids():
    if dataset.name == dataset_filename:
        break
else:
    dataset = h5py.File(dataset_filename)

from fuel.

vdumoulin commented on June 15, 2024

@bartvm If we re-use the same file object multiple times we may run into problems when closing it.

For instance, say I have a training and a test set stored in one single file under two different subgroups. If I call close on the training set's file handle, then all operations on the test set file handle will raise errors.

from fuel.

bartvm commented on June 15, 2024

There's two ways around this that I see:

Don't close it. h5py automatically closes files when the script exits, we could just rely on that behaviour, but I guess that it's not very clean.
Implement a very crude reference counter, basically just extend my previous code with

dataset.ref_count = getattr(dataset, 'ref_count', 0) + 1

and instead of doing dataset.close() we do

dataset.ref_count -= 1
if not dataset.ref_count:
    dataset.close()

from fuel.

dwf commented on June 15, 2024

Could we handle this with class attributes? Just a process-wide registry of all files open with that class?

from fuel.

rizar commented on June 15, 2024

@bartvm, okay, those are good reasons.

I would also add one that at the conversion stage we can use libraries that have issues with pickling, such as into.

But I was scared with the following item of @dwf's todo list:

Refactor loading logic into low-level file format utilities for loading original source data.

I understood it as getting rid of all datasets we have and move their logic into conversion scripts. I think it is too radical, and we should keep the datasets we have. Moreover, if we are just are about to support some new format via conversion, let's better have conversion logic as a simple Dataset class. Even without nice features such as caching or picklability in the middle of the epoch it can be of some use. It seems to me that supporting IndexableDataset and IterableDataset interface is not that hard. The conversion scripts will then be able to use these dataset interfaces.

from fuel.

bartvm commented on June 15, 2024

@rizar Good point about into. Could be very useful and make conversion very straightforward.

I don't think there are plans of getting rid of all dataset classes! There's a confusion here I think between "datasets" referring to the publicly available, file-based ones (MNIST, CIFAR-10, etc.) and "datasets" the classes in Fuel. @dwf was talking about the former i.e. instead of writing a CIFAR10(Dataset) class that unpickles CIFAR-10 we write a script that turns CIFAR-10 into HDF5 and load that. However, we should obviously keep Fuel datasets like IndexableDataset and IterableDataset allowing people to train on Python containers/iterators. I think it is also a good idea to still have datasets like TextDataset, CSVDataset, ImageDataset, NumPyFileDataset, etc. However, we don't need these to be super-efficient or to support all possible feature now. My idea is then that there are really 3 ways of creating a Fuel dataset then:

If you have a Python container or iterable, just use IndexableDataset or IterableDataset. Perfect for derived classes to use, for toy datasets, or for tests.
For quick and dirty experiments on data you have lying around, you can use e.g. CSVDataset, ImageDataset, etc. to read in data directly. However, these can be slow, you need to load everything in memory, they won't support start and stop, maybe no random access, etc.
Built-in datasets like MNIST, CIFAR-10 will be in HDF5, and we will provide conversion scripts (and a user can obviously easily convert their own dataset to HDF5 as well). We'll try to make sure that these classes iterate as soon as possible. The user has a range of options such as loading subsets, loading into memory or streaming from disk, random access, eventually we could support high-speed compression, etc.

@dwf Yeah, a class attribute would work too. I just figured it might be safer to keep the reference counter on the HDF5 file object directly, just in case multiple classes access the same file for some reason.

from fuel.

rizar commented on June 15, 2024

Good that we agree regarding TextDataset, CSVDataset and so on: I also think that we should keep those.

But also when we have the loading logic for a custom (MNIST, CIFAR) dataset, it can be wrapped to be a Dataset with very little overhead. Just iteration logic is enough to make an IterableDataset, simply loading into memory makes a small dataset an instance of IndexableDataset. I am not sure that we should delete the current MNIST and CIFAR classes and move the code from respective files in something like convert_mnist.py or convert_cifar.py.

from fuel.

bartvm commented on June 15, 2024

What's the point of all this HDF5 stuff if we don't use it for any dataset?
MNIST should become something like

class MNIST(HDF5Dataset):
file = 'mnist.hdf5'

And people will need to run fuel-convert --mnist once in order to create
this file. The file-reading logic that is now in MNIST would go to this
conversion script.

For MNIST this seems overkill, but it's just for consistency. For larger
datasets (SVHN) or datasets stored in crappy file formats (binarised MNIST)
this is already worthwhile.
On Mar 8, 2015 11:02 AM, "Dmitry Bogdanov" [email protected] wrote:

Good that we agree regarding TextDataset, CSVDataset and so on: I also
think that we should keep those.

But also when we have the loading logic for a custom (MNIST, CIFAR)
dataset, it can be wrapped to be a Dataset with very little overhead.
Just iteration logic is enough to make an IterableDataset, simply loading
into memory makes a small dataset an instance of IndexableDataset. I am
not sure that we should delete the current MNIST and CIFAR classes and
move the code from respective files in something like convert_mnist.py or
convert_cifar.py.

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

vdumoulin commented on June 15, 2024

@bartvm The counter should probably go in the low-level FileID object, since that's what's returned by get_obj_ids().

I'm not too fond of monkeypatching FileID instances, however. It doesn't look very clean to me.

from fuel.

bartvm commented on June 15, 2024

Yeah, in my example dataset is such an instance. We're not really monkey
patching though, since we're adding attributes, not overriding them. I
think that's okay.
On Mar 8, 2015 11:54 AM, "vdumoulin" [email protected] wrote:

@bartvm https://github.com/bartvm The counter should probably go in the
low-level FileID object, since that's what's returned by get_obj_ids().

I'm not too fond of monkeypatching FileID instances, however. It doesn't
look very clean to me.

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

rizar commented on June 15, 2024

What's the point of all this HDF5 stuff if we don't use it for any dataset?

The point is that any dataset can be converted to HDF5, that instead of implementing advanced things like reading parts of file for every format, we will only have to concentrate on HDF5. But there is also "you don't pay for what you don't use" principle, and I do not think that forcing user to convert MNIST to HDF5 for the reason that HDF5 is the best solution for let's say ImageNet is a good idea. Anyway for conversion we will have to have at least sequential reading implemented for all the formats Fuel is aware off, so why not allow direct access to the data read bypassing HDF5 when it so easy?

Put it differently, I find it better to have such super-common cases as MNIST and CIFAR supported out of the box, without the conversion step, especially taking the low cost of doing this.

from fuel.

bartvm commented on June 15, 2024

I disagree. For one, it's more consistent and doesn't leave us with two datasets, NumPyMNIST and HDF5MNIST, and leave users guessing "Is this dataset HDF5-only? Or does it support both? And which one should I use? Which one supports what features?" We'll also likely end up duplicating a bunch of code regarding flattening images, axes semantics (for #13), casting/scaling (i.e. if the user wants unsigned integer values instead of floating point values).

If the user insists on not converting, it's also one step away to do IndexableDataset({'features': read_mnist_images('train-images-idx3-ubyte'), 'labels': read_mnist_labels('train-labels-idx1-ubyte')}). In that case it is clear that the user is willing to accept a dataset with less features. But the normal use case would simply be:

cd FUEL_DATA_PATH/mnist
fuel-convert --mnist
python -c "from fuel.datasets import MNIST; MNIST('train')"

The payment for not using is two lines, and these two lines will make the general usage paradigm very clear to users. Special cases aren't special enough to break the rules I'd say here.

from fuel.

rizar commented on June 15, 2024

So the snippet below does not tell the whole story, if I understood you right! You do plan to support a lot of dataset specific features, don't you?

class MNIST(HDF5Dataset): 
    file = 'mnist.hdf5'

from fuel.

bartvm commented on June 15, 2024

What do you mean by dataset specific features? An HDF5 file can contain groups (in which we can define validation set, test set, etc.) and it allows for axis semantics (called "dimension scales"). The flattening, casting, etc. will be part of the HDF5Dataset class. So it could look something like

HDF5Dataset(Dataset):
    def __init__(self, flatten=None, mapping=None, ...):
        dataset = h5py.File(os.path.join(config.data_path, self.file))
        ...

class MNIST(HDF5Dataset):
    file = 'mnist.hdf5'

mnist = MNIST('train')  # Loads the /train/features and /train/targets nodes in the file
mnist = MNIST('train', flatten=True)  # Flattens all loaded data to be 2D
mnist = MNIST('train', flatten=True, mapping=[Cast(dtype='float32'), Scale(1/255)])  # Applies mapping
mnist = MNIST('train', load_into_memory=True)  # Does data = numpy.asarray(data) to load into memory

from fuel.

vdumoulin commented on June 15, 2024

Looks like we'll have to find another place to put the reference counter than in the FileID object, as it appears to be immutable.

from fuel.

bartvm commented on June 15, 2024

Mm, too bad. We could store it as a class attribute on HDF5Dataset as David
suggested e.g. as a dictionary of file-id: count.
On Mar 8, 2015 3:39 PM, "vdumoulin" [email protected] wrote:

Looks like we'll have to find another place to put the reference counter
than in the FileID object, as it appears to be immutable.

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

vdumoulin commented on June 15, 2024

Here's a very minimal example of what could be done:

class H5PYDataset(Dataset):
    """An HDF5 dataset.

    Parameters
    ----------
    path : str
        Path to the HDF5 file.
    which_set : str
        Subgroup containing the requested data.

    """
    ref_counts = dict()

    def __init__(self, path, which_set, **kwargs):
        self.path = path
        self.which_set = which_set
        h5file = self.open()
        self.provides_sources = h5file[self.which_set].keys()
        self.close(h5file)

        super(H5PYDataset, self).__init__(**kwargs)

    def open(self):
        file_ids = filter(
            lambda x: x.name == self.path, self.ref_counts.keys())
        name = file_ids[0] if file_ids else self.path
        h5file = h5py.File(name=name, mode="r")
        self.ref_counts[h5file.id] = self.ref_counts.get(h5file.id, 0) + 1
        return h5file

    def close(self, state):
        self.ref_counts[state.id] -= 1
        if not self.ref_counts[state.id]:
            del self.ref_counts[state.id]
            state.close()

    def get_data(self, state=None, request=None):
        return self.filter_sources([data_source[request] for data_source in
                                    state[self.which_set].values()])

from fuel.

vdumoulin commented on June 15, 2024

The example lacks start and stop constructor arguments, but other than that it's functional.

It assumes that the HDF5 file has the following structure:

Subgroups of root correspond to the available splits (e.g. 'train', 'valid', 'test')
Children of these subgroups are datasets whose names correspond to the available data sources

from fuel.

bartvm commented on June 15, 2024

self.ref_counts should probably be H5PYDataset.ref_counts so that different instances can share file handles.

I was thinking about the splits by the way, and whether it really makes sense to keep them in separate nodes. Perhaps I'm trying to be too general, but strictly speaking it seems strange to limit ourselves to one split and hard code that. One example I can think of where multiple splits are handy is for datasets that are a combination of multiple sources e.g. for machine translation you would want to keep track of which corpora the original sentences came from, but that doesn't necessarily overlap with the train, valid and test splits. There are also use cases I guess where you would want to load the entire dataset, which is impossible for e.g. MNIST right now, and implementing it would mean an awkward concatenation of the different sets.

For splits that are contiguous, we could easily store them as HDF5 attributes by just doing e.g. h5file.attrs['train'] = [0, 60000]; h5file.attrs['train'] = [60000, 70000]. If we ever run into a case where we want splits that are non-contiguous, we could use something like h5file.attrs['alternative_train'] = '/split/alternative_train' and store lists of indices in /split/alternative_train, /split/alternative_test, etc. The logic is then:

MNIST() loads the HDF5 file in its entirety
MNIST('train') checks the split = h5file.attrs['train'] attribute and loads
- if it finds an array it slices h5file[source][split[0]:split[1]]
- if it finds a string, it loads h5file[source][h5file[split]]

from fuel.

dwf commented on June 15, 2024

Err, self.ref_counts will resolve the class attribute on access, will it
not? (If we were replacing it for instance method code, that would be an
issue).

On Sun, Mar 8, 2015 at 5:10 PM, Bart van Merriënboer <
[email protected]> wrote:

self.ref_counts should probably be H5PYDataset.ref_counts so that
different instances can share file handles.

I was thinking about the splits by the way, and whether it really makes
sense to keep them in separate nodes. Perhaps I'm trying to be too general,
but strictly speaking it seems strange to limit ourselves to one split and
hard code that. One example I can think of where multiple splits are handy
is for datasets that are a combination of multiple sources e.g. for machine
translation you would want to keep track of which corpora the original
sentences came from, but that doesn't necessarily overlap with the train,
valid and test splits. There are also use cases I guess where you would
want to load the entire dataset, which is impossible for e.g. MNIST right
now, and implementing it would mean an awkward concatenation of the
different sets.

For splits that are contiguous, we could easily store them as HDF5
attributes by just doing e.g. h5file.attrs['train'] = [0, 60000];
h5file.attrs['train'] = [60000, 70000]. If we ever run into a case where
we want splits that are non-contiguous, we could use something like h5file.attrs['alternative_train']
= '/split/alternative_train' and store lists of indices in
/split/alternative_train, /split/alternative_test, etc. The logic is then:

MNIST() loads the HDF5 file in its entirety

MNIST('train') checks the split = h5file.attrs['train'] attribute
and loads

if it finds an array it slices h5file[source][split[0]:split[1]]

if it finds a string, it loads h5file[source][h5file[split]]

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

bartvm commented on June 15, 2024

It would resolve to it when reading, but it won't when writing to it. In order to write, you need to explicitly access the class attribute.

class Foo(object):
    bar = 1

>>> foo = Foo()
>>> foo.bar = 2
>>> foo.bar
2
>>> Foo.bar
1

from fuel.

dwf commented on June 15, 2024

Right, but wouldn't be generally be mutating that dictionary rather than
overwriting the entire dictionary itself?

On Sun, Mar 8, 2015 at 7:39 PM, Bart van Merriënboer <
[email protected]> wrote:

It would resolve to it when reading, but it won't when writing to it. In
order to write, you need to explicitly access the class attribute.

class Foo(object):
bar = 1

foo = Foo()>>> foo.bar = 2>>> foo.bar2>>> Foo.bar1

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

bartvm commented on June 15, 2024

Very good point, ignore me :)

from fuel.

vdumoulin commented on June 15, 2024

I opened a WIP PR (#43) over the implementation of H5PYDataset.

from fuel.

vdumoulin commented on June 15, 2024

One thing to keep in mind with all-in-one-file datasets is that using the 'core' driver option will load the entire file in memory, even though we might end up using only part of it.

from fuel.

bartvm commented on June 15, 2024

I guess that would be the user's fault anyway, for passing core explicitly. But I'm wondering if that is actually true, the documentation says

Memory-map the entire file; all operations are performed in memory and written back out when the file is closed.

For writing this would mean that the entire file ends up in memory if you are creating it from scratch, but for reading that's not entirely clear to me. It seems to suggest copy-on-write behavior to me, so as long as you don't perform any writes you'll just be reading memory-mapped data, no?

from fuel.

vdumoulin commented on June 15, 2024

I did a very informal test on an HDF5 version of binarized MNIST where I

instantiated the training and test sets,
called open() on the training set,
called close() on the training set,
called open() on the test set, and
called close() on the test set.

Memory usage increased by the same ~50M after step 2 and 4, which corresponds to the size of my HDF5 file on disk.

from fuel.

bartvm commented on June 15, 2024

That might be the caching mechanism though. HDF5 is pretty smart about
keeping as much in memory as possible. I guess you'd have to try with a
file that is larger than your RAM.

Then again, do we care much about what happens in this particular driver
mode? Loading into memory can be done with something like
np.asarray(dset[slice]), which would only load a subset, so we don't need
this driver mode, do we?
On Mar 9, 2015 11:09 AM, "vdumoulin" [email protected] wrote:

I did a very informal test on an HDF5 version of binarized MNIST where I

instantiated the training and test sets,

called open() on the training set,

called close() on the training set,

called open() on the test set, and

called close() on the test set.

Memory usage increased by the same ~50M after step 2 and 4, which
corresponds to the size of my HDF5 file on disk.

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

vdumoulin commented on June 15, 2024

The core driver looks like the simplest way to have your dataset in memory from an implementation perspective, but you're right that loading into memory can be done by hand.

from fuel.

vdumoulin commented on June 15, 2024

Concerning splits, I thought about @bartvm's suggestion to concatenate all splits and have the split information be an HDF5 dataset attribute, and I'm starting to think it's a pretty good idea.

Here are some additional thoughts:

We have to be careful not to accidentally load things into memory. That means some indexing gymnastics, especially in the non-contiguous case. It's not a dealbreaker, but it's something to keep in mind.
We need to think carefully about the semantics of which_set, start and stop. For instance, what does which_set, start, stop = 'test', 100, 500 mean? I propose the following:
- which_set is resolved first
- start and stop define a slice within the context of which_set

from fuel.

vdumoulin commented on June 15, 2024

Another question for you guys: I started working on the data conversion script, and I was wondering where to put it.

Would a bin directory à la Theano work for you?

from fuel.

bartvm commented on June 15, 2024

Yes, but more à la Blocks i.e. with the command parsing in a bin
directory, while keeping the actual conversion logic in a converters
directory that is in the Fuel namespace (so importable).
On Mar 11, 2015 11:51 AM, "vdumoulin" [email protected] wrote:

Another question for you guys: I started working on the data conversion
script, and I was wondering where to put it.

Would a bin directory à la Theano fork for you?

—
Reply to this email directly or view it on GitHub
#41 (comment).

from fuel.

dwf commented on June 15, 2024

There are also use cases I guess where you would want to load the entire dataset, which is impossible for e.g. MNIST right now, and implementing it would mean an awkward concatenation of the different sets.

Can you give an example of where this might be useful? I'm having a hard time coming up with one.

from fuel.

bartvm commented on June 15, 2024

Eh... Applying a kind of preprocessing to the entire dataset? Let's say I want to train on MNIST upside down or want to limit the vocabulary of my corpus, right now I need to load and apply the rotations/token replacements to each split separately.

from fuel.

dwf commented on June 15, 2024

That's convincing enough I guess.

from fuel.

vdumoulin commented on June 15, 2024

I think we're in a state where this issue has been fixed. I'll close it, but feel free to re-open it if you feel like something's still missing.

from fuel.

Proposed HDF5-centric dataset refactor about fuel HOT 40 CLOSED

Comments (40)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs