mila-iqia / fuel Goto Github PK

View Code? Open in Web Editor NEW

870.0 65.0 272.0 1.3 MB

A data pipeline framework for machine learning

License: MIT License

Python 99.89% Shell 0.11%

fuel's Introduction

https://travis-ci.org/mila-udem/fuel.svg?branch=master

https://readthedocs.org/projects/fuel/badge/?version=latest

Fuel

Fuel provides your machine learning models with the data they need to learn.

Interfaces to common datasets such as MNIST, CIFAR-10 (image datasets), Google's One Billion Words (text), and many more
The ability to iterate over your data in a variety of ways, such as in minibatches with shuffled/sequential examples
A pipeline of preprocessors that allow you to edit your data on-the-fly, for example by adding noise, extracting n-grams from sentences, extracting patches from images, etc.
Ensure that the entire pipeline is serializable with pickle; this is a requirement for being able to checkpoint and resume long-running experiments. For this, we rely heavily on the picklable_itertools library.

Fuel is developed primarily for use by Blocks, a Theano toolkit that helps you train neural networks.

If you have questions, don't hesitate to write to the mailing list.

Citing Fuel

If you use Blocks or Fuel in your work, we'd really appreciate it if you could cite the following paper:

Bart van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio, "Blocks and Fuel: Frameworks for deep learning," arXiv preprint arXiv:1506.00619 [cs.LG], 2015.

Documentation

Please see the documentation for more information.

fuel's People

Stargazers

Watchers

Forkers

vdumoulin wavelets memimo intermezzo-fr dmitriy-serdyuk arasmus capybaralet dwf adbrebs thrandis laurent-dinh kelvinxu mducoffe mpezeshki caglar sebastien-j bouthilx orhanf edersantana janchorowski udibr jfsantos harm-devries kastnerkyle ejls aalmah julianser cosmoharrigan spartonia johnfrye mmeloon matthiasreisser lukemetz ronekko luffyhwl lamblin ctozlm daviddjchen oiclid zxvix jesselivezey atousatorabi rizar alumae saltydizz negar-rostamzadeh shabanian rodrigob danielmckeown nagyistoce chrishokamp hantek codeaudit dribnet o1lo01ol1o pbrakel fedorajzf mjwillson dhruvparamhans commonlibs liangkai markusnagel scyfer nke001 abergeron zencoding vkarthi46 livitki ablavatski jianbotang parthasen hyeokhyen pombredanne suryanarayadev asifpy dikoufu mcanthony oplatek lixiangnlp anukat2015 critias johnarevalo nikolayvoronchikhin alexmlamb tfjgeorge sandy4321 solertis gnperdue yk pvrancx beronx86 wgapl sygi reggiemead ehosseiniasl xuchongbo guillaumebrg afrik taylerablake xinghedyc

fuel's Issues

Unpack

From mila-iqia/blocks#329

It would be nice to have two new wrappers: one that sorts examples in a batch according to a given key function and one that unpacks batches to compose a stream of examples (like itertools.chain does). These two wrappers could be used one after another to make the data stream more uniform, which can yield very significant speed up in some cases, e.g. when the examples are sequences and it is desirable to have sentences of similar lengths in a batch.

Altogether it should look like that:

data_stream = DataStream(dataset, iteration_scheme=ShuffledScheme(2000))
data_stream = Sort(data_stream, key=_get_input_length)
# this one has long segments of sorted examples: uniform batches can be formed
data_stream = Unpack(data_stream) 
# a data stream with uniform batches!
data_stream = BatchDataStream(data_stream, iteration_scheme=Constant(10))

Refactor lazy properties

Right now lazy properties are part of the InMemoryDataset, however, their principle is useful in other cases as well, see e.g. #9 (comment).

We need to rename lazy properties to something more descriptive, move it out of InMemoryDataset, and redo the documentation to highlight its more general applicability.

HDF5 file format versioning

We should introduce versioning into the HDF5-based format we save, so that we can detect what set of conventions were being followed at that point in time, and read and interpret files that were not written with the current version of the code.

My personal feeling is that format version need not be tied to (or rely upon) Fuel package versioning. This is the approach adopted by NumPy for NPY, and it has worked well.

MultiProcessing doesn't work with HDF5 files

Seems, that there is an issue with MultiProcessing transformer when HDF5 file is used.

Here is a stack trace:

HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 182, in H5Dread
    can't read data
  File "H5Dio.c", line 550, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1866, in H5D__chunk_read
    chunked read failed
  File "H5Dscatgath.c", line 542, in H5D__scatgath_read
    datatype conversion failed
  File "H5T.c", line 4809, in H5T_convert
    data type conversion failed
  File "H5Tconv.c", line 3216, in H5T__conv_vlen
    can't read VL data
  File "H5Tvlen.c", line 891, in H5T_vlen_disk_read
    Unable to read VL information
  File "H5HG.c", line 622, in H5HG_read
    unable to protect global heap
  File "H5HG.c", line 262, in H5HG_protect
    unable to protect global heap
  File "H5AC.c", line 1329, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3570, in H5C_protect
    can't load entry
  File "H5C.c", line 7950, in H5C_load_entry
    unable to load entry
  File "H5HGcache.c", line 141, in H5HG_load
    bad global heap collection signature

End of HDF5 error back trace

H5PYDataset should load lazily (and probably not call load() in constructor)

Isn't that the point of do_not_pickle_attributes after all?

More flexibility to validation set selection

Currently I don't see any clean way to do traditional k-fold or leave-one-out cross-validation or any more flexible division of the training data to the validation set. There is only this (start, end) pair available and it is not flexible enough for this purpose (not possible to define two ranges for training and leave in-between samples for validation). This should be a very typical approach in machine learning in general, but with DL datasets are so big that people don't afford to do this. Still, I'd like fuel to support it. What do you think?

I would be personally happy with just adding a shuffle switch that would randomize the order of samples before division. But of course it would need to be internally randomized only once when opened to avoid train and validation from overlapping.

So this is what I traditionally do:

train = MNIST("train", end=50000) # first 50k samples
valid = MNIST("train", start=50000) # the last 10k samples
test = MNIST("test")

One work-around would be to only create one dataset from "train" and use Transformers somehow, but this data division is so important part of the process that I think fuel should support it directly. Especially because it is pretty easy to implement and support.

from Pylearn2 SequenceDataset to Fuel

I'm trying to test Blocks RNNs, and I think the markov chain example is a little bit hard to read. I can understand the RNN bricks though. I wish I could know what is the easiest/right way to translate Pylearn2 datasets to Fuel datasets? More specifically, I have a pylearn2.VectorSpaceDataset based on pylearn2.SequenceDataSpace and I wish I could use it with Blocks. Also, the dataset iterator is reading from a HDF5 file and swapping the axis to make time the first dimension. How can I just wrap this dataset with a Fuel class and throw it to blocks.DataStream?

Do you have any guidelines for doing so?

Refactor CIFAR10 to subclass H5PYDataset

See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:

Merge transformer doesn't support as_dict argument in get_epoch_iterator

Title explains it.

Ensure serializability MultiProcessing

Set up scrutinizer

Pylearn2 wrapper?

(Sorry if creating a new issue is not the best way to communicate ... is there a preferred method?)

I have been working in pylearn2 for some time now and I am experiencing frustration in its limitations regarding hdf5 datasets. I'm very interested in using fuel for managing my datasets and I was excited to see in the documentation that there is a future direction to integrate fuel and pylearn2. However, I'm a little impatient to start taking advantage of fuel :).

Do you have any recommendations or prototypes towards pylearn2 integration? Seems like communication through the pylearn2 iterator is key. I started to hole-fill the fuel H5PYDataset so that it implemented a pylearn2-capable iterator, but it's turning out to be a big job and I'm not 100% sure that it's the right direction. At this point I'd like to hear your thoughts before I start cowboy coding a fork. I looked in the branches/forks and didn't immediately find anything relevant. Thanks!

Remove blocks imports and references

Number of examples/batches for streams, iteration schemes, and datasets

It is often possible to determine from the iteration scheme and the dataset how many batches will be returned (e.g. ceil(num_examples / batch_size)). This information could be useful (e.g. for the progress bar in Blocks) but the question is which component would implement this logic (because data streams and iteration schemes are agnostic to each other, and it would be a shame to break that). It might also become quite complicated quite quickly (e.g. an iteration scheme that produces more requests than the data stream provides, many data stream wrappers don't have wrappers but simply maintain the number of examples, etc.)

Related to this: right now there is no documented interface for datasets. The question would be whether we want to make e.g. num_examples compulsory (allowing NaN and Inf as values).

Add Iris dataset to built-in datasets

You can find info about the Iris dataset here.

See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:

Factor out common logic MNIST, CIFAR-10 and similar datasets

Further to my comment at #2, things that should probably be factored out:

Finding files

All datasets that load files have similar logic: They concatenate the FUEL_DATA_PATH with the name of the folder in which the dataset is stored, and with the name of the file that we want to read. This should probably be factored out into a get_path() method which can be part of a mix-in DataFiles class or something. This can also raise a custom error with information on how to set the data path in the configuration.

Start/stop

This one might be a bit trickier, but should probably be factored out into a mix-in class as well so that every dataset that is loaded into memory can simply use this mixin class to support the start and stop arguments.

In-memory datasets

Maybe this could be combined with the current InMemoryDataset class, or maybe we could factor it out into a separate class, but the logic of raising an error when the state is not None and simply returning the indices of the data could also be factored out, because it will be needed for any dataset whose get_data logic is basically just indexing.

Flattening

The datasets should probably just return the original images, and we should have a flattening stream that does the flattening batch-wise.

Remove dependency on both PyTables and H5PY

Fuel currently depends on both PyTables and h5py, using them for different datasets. Ideally I think we should at least make PyTables an optional dependency.

H5PYDataset does not play well with DataStream for serialization

This discussion started at the mailing list.

ref_sources is a class attribute in H5PYDataset, which causes it to not be serialized. However, if you use a DataStream in your code, a handle to the HDF5 file is saved as an instance attribute, which causes it to be serialized. However, most objects in h5py can't be serialized (more info here), which causes loading a pickled DataStream to fail. I am not sure on what could be done, as saving the DataStream state is useful when the dataset is not an H5PYDataset. The handle to the HDF5 file does indeed hold a state while the file is loaded (which driver is being used, buffering settings, etc), but this information is not useful if reloading an HDF5 file from scratch.

Document how to deal with synchronous Blocks/Pylearn2/Fuel changes

When the external interface of Fuel changes e.g. by renaming/removing methods/classes that could be imported, it could break libraries that depend on Fuel (Blocks right now, eventually Pylearn2).

In this case, we will need to coordinate changes to these libraries and Fuel in the following way:

Make changes to Fuel. Travis runs Blocks tests from master by default.
If Blocks tests crash because it needs changes, make needed changes to Blocks and create a PR for that as well.
Change @master to @my_blocks_pr in Fuel's .travis.yml, running the tests from the new Blocks branch.
If both PRs pass, change @my_pr back to @master and merge both PRs simultaneously.

Sources contract

Is it assumed that sources are fully "parallel", i.e. same number of examples in each?

Two issues come to mind:

Some datasets, including recent ImageNet challenge datasets, do not have test set groundtruth. It would be nice to include test data images even if there are no labels. One could define an "unknown" label but this might get messy.
Some datasets include rather extensive metadata (I'm thinking of tree-structured prediction tasks, needing to put the encoded tree somewhere). Is this a "source"? Presumably not, but maybe?

Add documentation for axis labeling

Transformers should have batch- and example-specific methods

I was just wondering whether we should make this a kind of policy: It's okay (and expected) for transformers to only act on examples, not on batches. There are basically two arguments that I can think of:

Pro

It makes code a lot simpler. This is n-grams for batches (and it's actually still not complete):

        features, 
        for _, sentence in enumerate(self.cache[0]):
            features.append(list(
                sliding_window(self.ngram_order,
                               sentence[:-1]))[:request - len(features)])
            targets.append(
                sentence[self.ngram_order:][:request - len(targets)])
            self.cache[0][0] = self.cache[0][0][request:]
            if not self.cache[0][0]:
                self.cache[0].pop(0)
                if not self.cache[0]:
                    self._cache()
            if len(features) == request:
                break
        return tuple(numpy.asarray(data) for data in (features, targets))

and this is it for examples:

        while not self.index < len(self.sentence) - self.ngram_order:
            self.sentence, = next(self.child_epoch_iterator)
            self.index = 0
        ngram = self.sentence[self.index:self.index + self.ngram_order]
        target = self.sentence[self.index + self.ngram_order]
        self.index += 1
        return (ngram, target)

If NGramStream had to deal both with batches and with examples, the code would be very long for such a simple operation. This goes for many, many cases. Hence, I'd prefer transformers to work on examples, and expect the user to add a BatchStream at the end.

Con

Speed. Performing operations on batches can often be faster.

My take on it is that we can aim at one of two things:

We can try to make Fuel as efficient as possible. That means quite a bit of code to make sure that we handle large batches efficiently, and it might limit our ability to easily add new transformers (because they need all this logic coded up).

Alternatively, we can just say that our primary goal is the easy creation of processing pipelines. We will care more about prototyping e.g. testing dozens of different combinations of transformers to see which one works best, and making it very easy to add new ones. This means that the pipelines might not be as fast as they could be, but I think (hope) not slow enough to be prohibitive. Once you have found your optimal pre-processing pipeline and really need the speed, it should be easy to code up a single, specialized transformer that does everything you want more efficiently on batches/in Cython/using GPU/etc.

H5PYDataset lost split data if not load_in_memory

I followed the tutorial here to create a custom dataset. But when loading data using H5PYDataset, some of the data seems to be lost when there is a split array.

Here is an example.
https://gist.github.com/bobogei81123/6a2d93caa6f809526e4f
It seems like setting load_in_memory=True could solve the problem.
(Also, ShuffleScheme won't work if load_in_memory=False)

RenameStream

From mila-iqia/blocks#267

Something that @arasmus ran into: When combining multiple data streams, the names of the sources might clash. I guess a general solution for this would be to create a RenameStream which just renames the sources and passes on the data directly. That way you can put than into the yet-to-be-written data stream-chainer.

Write regression test for #44

See #44.

Document how to write a script that spawns a dataset server before training

Launching a combo dataset server/experiment script is going to be a pretty common use case. It'd be nice to include some sort of helper to make this as painless as possible.

Indexable Assumes Data is Available in Constructor

Currently Indexable requires data to be passed as a constructor argument. I do not find it nice, because some questions about the dataset can be answered even before actually loading the data. See https://groups.google.com/forum/#!topic/blocks-users/Bh9HtqmXjJk, for instance.

Add 'Adult' dataset to built-in datasets

You can find info about the Adult dataset here.

See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:

Refactor MNIST to subclass H5PYDataset

See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:

I got a crashing (Blocks) MNIST tutorial using float32 and GPU

When setting ~/.theanorc floatX=float32 and device=gpu, I couldn't run the Blocks MNIST tutorial. I thought that Fuel could check theano.config.floatX and serve downcasted when necessary.

configparser instead of pyyaml?

Just wondering: It seems your requirements for configuration-file parsing are not very high. Maybe it would be sufficient to use https://docs.python.org/2/library/configparser.html ?

Pros: it's part of the std. lib for Python 2 and 3
Cons: a) it's Windows-INI style syntax; b) arguably less powerful than YAML.

Turn 'how to create a new dataset' tutorial into unit tests

A Scheme to Iterate over Single Examples

It is currently impossible to iterate over MNIST examples, not batches of examples.

Even SequentialScheme(num_examples, 1) does not fill this gap, because the requests it provides are singleton lists.

I would solve that by adding a new scheme, let's say ExamplesScheme, whose requests would be integers, not lists.

Type checking data through axis semantics

Following offline discussion with @lamblin and @vdumoulin, we agreed that the most important kind of type checking to perform in the data processing pipeline is probably the semantics of the axes of the data. Data streams should provide information about what each axis of the input and output represents e.g.

An image: (channel, height, width)
A batch of images: `(batch, channel, height, width)
A sentence (sequence of indices): (features) (maybe a labels role?) or before going into Blocks: (time, batch, features)
A set of n-grams from a sentence: (batch, features)

The behaviour of data streams regarding these labels should be configurable, so they can either ignore, warn or raise errors if the data input is not what they expected.

Some things that need to be thought about:

Do we just use strings, or do use singletons (allowing us to create a class hierarchy)?
Do we want to add dimensionality e.g. each axis has a dimensionality (or can be variable)? This could be useful to check that e.g. an image has exactly 3 colour channels.
- Longer term, this would also allow for the kind of checking that Pylearn2 performs (e.g. make sure that the data dimension is the same as the input layer/brick).

This is closely related to mila-iqia/blocks#30

Create a google group for fuel users

This would allow people to ask questions without opening issues.

Add support for lists of indices in H5PYDataset subsets.

We should be able to create a split that is a list of disjoint indices, as well as specify the subset of an existing split as a list of disjoint indices.

Add CIFAR100 to built-in datasets

See these two tutorials (in order) for a crash-course on how to implement a new built-in dataset in Fuel:

Write a regression test verifying picklability of Unpack

See #44. mila-iqia/picklable-itertools@825eb0f solved this (I think), but we might as well add a test to make sure this doesn't happen again, just in case.

This could be eventual CCW fodder.

Automated data downloading

Copy from mila-iqia/blocks#105

We could use the https://github.com/jaberg/skdata to download popular datasets and even to load them into memory.

Integrating with skdata might not be necessary anymore now that we have a separate data framework. But we should still come up with a solution one way or the other.

Subset MNIST

Hi,

I'd like to run an experiment using a subset of the binarized MNIST 28 dataset.

I just want to run some quick experiments using say the first 12,000 images.

How do I do that?

Thanks,

Allow data server to use divide-and-conquer

@yaoli was interested in using a divide-and-conquer approach to preprocessing, as is used in in @dwf's ImageNet PR (#68). With that code, I think it should be relatively easy to update the preprocessing server to (optionally) use multiple workers as well.

Seems Like Unpack is not Picklable

My first attempt to use it ended up with the message below. I could not immediately find the reason in the code, will look later.

cPickle.PicklingError: Pickling failed.

Blocks relies on the ability to pickle the entire main loop, which includes all bricks, data streams, extensions, etc. All of these must be serializable using Pythons pickling library. This means certain things such as nested functions, lambda expressions, generators, etc. should be avoided. Please see the documentation for more detail.

Original exception:
        PicklingError: Can't pickle <type 'iterator'>: attribute lookup __builtin__.iterator failed

Original exception:
        PicklingError: Pickling failed.

Blocks relies on the ability to pickle the entire main loop, which includes all bricks, data streams, extensions, etc. All of these must be serializable using Pythons pickling library. This means certain things such as nested functions, lambda expressions, generators, etc. should be avoided. Please see the documentation for more detail

Write installation docs

Add to Readme.rst that libraries certifi and urllib3 are prerequisites, and add instructions on how to install these @vdumoulin.

Switch to lists for BatchedDataStream

See discussion at mila-iqia/blocks#336 (comment)

Wishlist: "server" process to do preprocessing in a separate thread

A bit like Bokeh, it would be great to have the option of launching a Fuel server which can do preprocessing in a separate thread.

I imagine a scenario like this: you'd ask the Fuel server to maintain a queue of 10 preprocessed batches of examples from some dataset (e.g. ZCA-whitened CIFAR10 images) according to some iteration scheme, and your client application (e.g. Blocks, Pylearn2) could "consume" this data while Fuel replenishes the queue in parallel.

Rename `DataStreamWrapper` to `Transformer`

DataStreamWrapper made sense when we were designing this, because the point was that data streams and data streams wrappers share the same interface. (Initially we actually tried making them one and the same class, but that ended up not happening, and now they just share a base class.)

The problem with DataStreamWrapper is that it's a bit obscure of a name, and I think it actually confuses people sometimes, because they're not sure what the difference between DataStream and DataStreamWrapper is. I actually like Pylearn2's Transformer better. Another alternative is Preprocessor, but that sounds too much like just doing whitening, etc. and not routine transformations like sorting, merging, etc.

So the terminology then becomes: Instantiate a dataset, and create a data stream that reads from this dataset (potentially using a particular iteration scheme). Afterwards you can apply a series of transformations on the data stream (some transformers can support the iteration_scheme argument, some don't, but the get_epoch_iterator interface is the same).

Thoughts? Here I go randomly tagging people for opinions again: @rizar @lamblin @vdumoulin @laurent-dinh @dwf @pbrakel @jbornschein

Proposed HDF5-centric dataset refactor

Following discussion on #35,

Refactor loading logic into low-level file format utilities for loading original source data.
Add HDF5DatasetFile class with load_in_memory attribute.
Add fuel-convert utility that uses the low-level utilities to load data and spit them out in HDF5.
Replace existing dataset classes with ones that use (Inherit, delegate?) to HDF5DatasetFIle.

progressbar for long-running converters?

I'm thinking it would be nice to display a progress bar when doing a super-long conversion, but I don't like the idea of littering the converter code itself with calls to progressbar.

One possibility is using the 'extra' dictionary on LogRecords and installing a custom handler in fuel-convert. That way it's just passed as usable metadata in regular logging calls and fuel-convert (or any other client) can use it as it likes. This seems fairly clean. Any thoughts?

Plan for wrapping Fuel datasets into Pylearn2

One of the goals for factoring Fuel out of Blocks was to be able to re-use it as a new dataset back-end in Pylearn2.
Here is a proposal and a starting plan for that. Ideas, comments and changes are very welcome.

Semantic information on axes

See ticket #13. Important parts would be:

Have that information stored in the dataset. For HDF5 datasets, this can be stored in the label attribute of each dimension (default is the empty string). We do not need "dimension scales" for that (they are for associating numeric information to the whole slice along that dimension). Done in #78 for H5PYDataset.
Have that information available for the inputs and outputs of a transformer, when it makes sense. Have a way for the user to specify them (for the outputs) if needed.
~~Have a convention for what they mean, and common abbreviations.~~ No abbreviations, no semantically empty names like "axis_0".

The wrapper itself

Given a Pylearn2 data_specs and a Fuel Dataset, it should be possible to create a Pylearn2 iterator that is usable by TrainingAlgorithms.

Implement FuelWrapperDataset in Pylearn2, constructed from a Fuel pipeline, with an option so that its iterator method can either:
- fail if the sources and dimension labels do not correspond to the requested data_specs
- add Transformers to the pipeline so that they correspond

Porting Pylearn2 Datasets to Fuel

When the proof of concept of using Fuel datasets from Pylearn2 works, we can start working on porting existing Pylearn2 datasets to Fuel.

Finish the tutorial on how to add a new dataset (#75)
For each dataset, add a ticket to:
- Write a script downloading the data from its original source. It should be put in fuel/downloaders/ and integrate with bin/fuel-download.
- Write a script that reads data and serialize it to HDF5 with the appropriate metadata. It should be put in fuel/converters/ and integrate with bin/fuel-convert.
- Add a function to instanciate the dataset itself from these files.
For each preprocessor, or preprocessor-like constructor option, write an equivalent Transformer

Bonus

If Fuel becomes an (optional at first) dependency for Pylearn2, it would be better if its ownership would be transferred to the MILA organization (https://github.com/mila-udem).

Document the implementation of a new DataStream

Could you possibly write some documentation on how to create a new datastream, and which methods to overload? I'm trying but having some trouble knowing what to implement.

mila-iqia / fuel Goto Github PK

fuel's Introduction

Fuel

fuel's People

Stargazers

Watchers

Forkers

fuel's Issues

Pro

Con

Semantic information on axes

The wrapper itself

Porting Pylearn2 Datasets to Fuel

Bonus

Recommend Projects

Recommend Topics

Recommend Org

Jobs