GithubHelp home page GithubHelp logo

Comments (11)

vdumoulin avatar vdumoulin commented on July 21, 2024

I think we should benchmark what performance hit we're looking at if we choose to do examplewise preprocessing.

It probably won't do much difference for large models (especially if we have good multithreading support to do the preprocessing in parallel), but for small models it may have a big impact.

It's not clear to me yet why batchwise and examplewise preprocessing should be mutually exclusive. I haven't looked at the code long enough to get a good high-level feel of how things fit together, so the following suggestion may not be applicable, but would it be possible to require that both batchwise and examplewise preprocessing are supported and have a default batchwise implementation that simply concatenates a bunch of examplewise calls?

from fuel.

bartvm avatar bartvm commented on July 21, 2024

They're not mutually exclusive per se, although there is no good way of checking whether you received a single example or a batch right now, besides just checking the shape of things or something. This can get quite messy, you end up with code like "if it's a list but the first element is also a list then I'm going to assume it's a batch" but some transformers should in principle work for lists, tuples, NumPy arrays, etc.

Although I'm not too keen on the idea of hard coding a is_batch flag, although that would it easier to implement transformers that deal with both. Then transformers could have a get_example and get_batch method instead of the current get_data, and get_batch would default to get_example(example) for example in batch.

My current proposal is simply to make most transformers example-only by default. For cases where the speed up is significant and the demand is high, we could implement a second, batch-wise version e.g. a Whiten transformer as well as a BatchWhiten transformer.

from fuel.

vdumoulin avatar vdumoulin commented on July 21, 2024

I still need to read the code more carefully, but I think I understand where you're getting at.

Depending on the number of useful batch transformations, we may end up having lots of Transform/BatchTransform pairs, though.

from fuel.

bartvm avatar bartvm commented on July 21, 2024

Mm, rather than having separate transformers, or automatically trying to deduce whether something is a batch, maybe we can introduce a flag batch=True which transformers can optionally support? Transformers that don't support it just act on examples, and those that do support it implement two methods, and use one or the other based on the value of the batch flag.

from fuel.

vdumoulin avatar vdumoulin commented on July 21, 2024

That would seem reasonable to me.

from fuel.

rizar avatar rizar commented on July 21, 2024

I fully agree that processing example-wise should be the predominate way for writing transformers. That will save lots of time for people writing and using them.

The idea to turn the optionally supported "batch mode" on seems very reasonable.

from fuel.

bartvm avatar bartvm commented on July 21, 2024

So here's an idea in slightly more detail:

  • get_data becomes get_example and get_batch
  • Each transformer takes a keyword argument batch which defaults to False. A transformer which only supports batches, sets batch = True as a class attribute.
  • If a transformer doesn't have a get_batch method, but batch=True was passed, no child_epoch_iterator will be set by the get_epoch_iterator method. Instead, the DataIterator will call batch = next(self.data_stream.data_stream) to retrieve the next batch and set self.data_stream.child_epoch_iterator to iter_(batch), iterating over the examples. It will then return [self.data_stream.get_example() for _ in range(len(batch))].

This has the following limitations, but they seem sensible:

  • An example-transformer can't be applied to a batch if it needs an iteration scheme (because it's not clear whether each request applies to the entire batch, or if there should be one per example).
  • NumPy ndarrays will end up being converted to lists of arrays. I'm not sure whether to special case this (just calling numpy.asarray on batches that were ndarrays when coming in), or to just expect the user to add a kind of AsNumpyArray transformer at the end.

from fuel.

bartvm avatar bartvm commented on July 21, 2024

Thought about it a bit more, and wondering now whether we should try to intelligently handle batches at all. I can think of quite a few issues:

  • Imagine a transformer which filters examples (rejecting them based on some sort of criterion). If we feed it a batch, should the size of the batch be maintained? Or should it just filter the given batch? If so, what do we do if it filters each example in the batch?
  • Likewise for padding; it doesn't make sense to apply the Padding stream example-wise.

So perhaps the simplest solution is the best: Transformers can implement two methods (get_example and get_batch). Which one is used depends on the default of the Transformer, or in the case both are supported, on whether the batch=True flag is set.

from fuel.

rizar avatar rizar commented on July 21, 2024

In the first case I would not support batch input at all. I think it is okay for some transformers to be example-only or batch-only, like your second example.

Your final proposal sounds good.

from fuel.

bartvm avatar bartvm commented on July 21, 2024

Being addressed in #40

from fuel.

bartvm avatar bartvm commented on July 21, 2024

Closed via #45 (rebase of #40).

from fuel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.