Comments (11)
I think we should benchmark what performance hit we're looking at if we choose to do examplewise preprocessing.
It probably won't do much difference for large models (especially if we have good multithreading support to do the preprocessing in parallel), but for small models it may have a big impact.
It's not clear to me yet why batchwise and examplewise preprocessing should be mutually exclusive. I haven't looked at the code long enough to get a good high-level feel of how things fit together, so the following suggestion may not be applicable, but would it be possible to require that both batchwise and examplewise preprocessing are supported and have a default batchwise implementation that simply concatenates a bunch of examplewise calls?
from fuel.
They're not mutually exclusive per se, although there is no good way of checking whether you received a single example or a batch right now, besides just checking the shape of things or something. This can get quite messy, you end up with code like "if it's a list but the first element is also a list then I'm going to assume it's a batch" but some transformers should in principle work for lists, tuples, NumPy arrays, etc.
Although I'm not too keen on the idea of hard coding a is_batch
flag, although that would it easier to implement transformers that deal with both. Then transformers could have a get_example
and get_batch
method instead of the current get_data
, and get_batch
would default to get_example(example) for example in batch
.
My current proposal is simply to make most transformers example-only by default. For cases where the speed up is significant and the demand is high, we could implement a second, batch-wise version e.g. a Whiten
transformer as well as a BatchWhiten
transformer.
from fuel.
I still need to read the code more carefully, but I think I understand where you're getting at.
Depending on the number of useful batch transformations, we may end up having lots of Transform
/BatchTransform
pairs, though.
from fuel.
Mm, rather than having separate transformers, or automatically trying to deduce whether something is a batch, maybe we can introduce a flag batch=True
which transformers can optionally support? Transformers that don't support it just act on examples, and those that do support it implement two methods, and use one or the other based on the value of the batch
flag.
from fuel.
That would seem reasonable to me.
from fuel.
I fully agree that processing example-wise should be the predominate way for writing transformers. That will save lots of time for people writing and using them.
The idea to turn the optionally supported "batch mode" on seems very reasonable.
from fuel.
So here's an idea in slightly more detail:
get_data
becomesget_example
andget_batch
- Each transformer takes a keyword argument
batch
which defaults toFalse
. A transformer which only supports batches, setsbatch = True
as a class attribute. - If a transformer doesn't have a
get_batch
method, butbatch=True
was passed, nochild_epoch_iterator
will be set by theget_epoch_iterator
method. Instead, theDataIterator
will callbatch = next(self.data_stream.data_stream)
to retrieve the next batch and setself.data_stream.child_epoch_iterator
toiter_(batch)
, iterating over the examples. It will then return[self.data_stream.get_example() for _ in range(len(batch))]
.
This has the following limitations, but they seem sensible:
- An example-transformer can't be applied to a batch if it needs an iteration scheme (because it's not clear whether each request applies to the entire batch, or if there should be one per example).
- NumPy ndarrays will end up being converted to lists of arrays. I'm not sure whether to special case this (just calling
numpy.asarray
on batches that were ndarrays when coming in), or to just expect the user to add a kind ofAsNumpyArray
transformer at the end.
from fuel.
Thought about it a bit more, and wondering now whether we should try to intelligently handle batches at all. I can think of quite a few issues:
- Imagine a transformer which filters examples (rejecting them based on some sort of criterion). If we feed it a batch, should the size of the batch be maintained? Or should it just filter the given batch? If so, what do we do if it filters each example in the batch?
- Likewise for padding; it doesn't make sense to apply the
Padding
stream example-wise.
So perhaps the simplest solution is the best: Transformers can implement two methods (get_example
and get_batch
). Which one is used depends on the default of the Transformer
, or in the case both are supported, on whether the batch=True
flag is set.
from fuel.
In the first case I would not support batch input at all. I think it is okay for some transformers to be example-only or batch-only, like your second example.
Your final proposal sounds good.
from fuel.
Being addressed in #40
from fuel.
Closed via #45 (rebase of #40).
from fuel.
Related Issues (20)
- KeyError: "Unable to open object (Object 'image_features' doesn't exist)" HOT 1
- Fixed HOT 1
- Built-in datasets: Convert still fails HOT 4
- Add support to make bucket to variable length data HOT 2
- Fuel Dataset Import error HOT 1
- Error when unpickling TextFile with text using encoding: "maximum recursion depth exceeded"
- Mapping won't work with mapping_accepts=dict and add_sources HOT 2
- Unicode error/crash HOT 3
- HDF5 version of ImageNet (ilsvrc 2012) and CIFAR-10 datasets. HOT 1
- Search over documentation gives wrong links
- ServerDataStream example is outdated: argument is missing
- CelebA Dataset: dropbox unstable HOT 2
- The installation process can't find build_ext. HOT 3
- pip install git+https://github.com/mila-udem/fuel.git@stable HOT 1
- [Feature Request] option to make batch size fixed HOT 1
- ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory
- Installation setup.py error on Mac HOT 1
- I downloaded fuel from git and used this command to install it error when I installed fuel
- I downloaded fuel from git and used this command to install it "python setup.py install" but I got this error HOT 2
- Could you offer the whl binary file of the fuel on windows?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuel.