GithubHelp home page GithubHelp logo

distillpub / post--feature-wise-transformations Goto Github PK

View Code? Open in Web Editor NEW
18.0 7.0 9.0 202.5 MB

Feature-Wise Transformations

Home Page: https://distill.pub/2018/feature-wise-transformations

License: Creative Commons Attribution 4.0 International

HTML 78.83% CSS 1.19% TeX 10.21% JavaScript 9.77%

post--feature-wise-transformations's Introduction

Distill submission repository

Submission document

The submission document is located at src/index.ejs.

Developing

npm install to install dependencies. (Can take minutes!)

npm run dev to run a watching server.

npm run build to build, transpile, babel-ify and minify files.

Components are in src. The .html files are svelte components, the .js files are compilation endpoints that are also defined in webpack.config.js. These compiled endpoints are then consumed by hand authored .ejs files in src.

Visit localhost:8080/index.html for a hot-reloading preview of the article.

post--feature-wise-transformations's People

Contributors

aaroncourville avatar colah avatar ethanjperez avatar fstrub95 avatar harm-devries avatar ludwigschubert avatar nschuc avatar shancarter avatar vdumoulin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

post--feature-wise-transformations's Issues

Archy's feedback

  • (Archy) I found the maths in [the bilinear transformation] section a bit confusing, probably because I'm not very good at maths, but also because of the number of different nomenclatures flying around. Some visualizations will hopefully help (and look like they're in the pipeline).
  • We should discuss the connection to highway networks, which uses sigmoidal gating (like is used in LSTMs) in the layers of a very deep network.
  • The LSTM and PixelCNN self-conditioning explanations are fragmented and list-like.
  • The phrase "Going backwards from the definition of each computation mechanism, we will now explain how they can be expressed in terms of generalized bilinear transformations." is a bit jarring.
  • Add a supporting example to "It could also be that an image needs to be processed in the context of a question being asked."
  • "Attention is a good example of one such principled approach: side information is used [...]" -> s/side/contextual
  • Add a supporting example to "For instance, in conditional decoder-based generative models, we would like to map a source of noise to model samples in a way that is class-aware."
  • The use of both feature-wise affine transformations and feature-wise linear modulation is confusing.
  • (Archy) I think [starting with concatenation] is very useful, i.e. "Let's think of the dumbest solution possible and examine why it doesn't work." I'd be tempted to move this further up the article, to give a simple concrete example of how you might incorporate contextual details. You can then explain the shortcomings and invoke FiLM as the solution. This will also help clear up some ambiguity for the naive reader as to whether you're considering this 'concatenate and forget' solution as an example of FiLM or not (it's not, right, because there is no modulation of existing features, you just add a bunch more?).
  • (Archy) I see we're working towards the definition of a FiLM that you gave earlier -- composed of a biasing and a scaling -- but that progress is not particularly limpid. Could you restructure to make it clear that you're talking about particular parts of the FiLM definition i.e. have subtitles like "+B(Z)" or "Y(x) . x" or something like that? It'd be helpful to more clearly embed the literature review in an exploration of the equation.
  • "Several variants of FiLM can be found in the literature." -> We need to make it clear that we've been discussing PARTS of the FiLM formulation, and now we're discussing models that use all of the bells and whistles.
  • "So far, the distinction between the FiLM generator and the FiLM-ed network has been rather clear, but it is not strictly necessary." -> To make it even more clear, we could spell it out here: "We've had one network which outputs parameters for the transformation, and these are applied to the layers of a second network."
  • "By feature-wise, we mean that scaling and shifting are applied element-wise, or in the case of convolutional networks, feature map-wise." -> (Archy says) This is a little confusing. Element-wise is a mathematical statement, but feature map-wise is a statement about what that unit of the network represents. We should explain the level of granularity in convolutional neural networks vs. fully-connected networks. In CNNs, a feature map is the same feature observed at different spatial locations.
  • Define gamma and beta before introducing them in the first equation of the article.
  • When discussing multiplicative interactions, it would be helpful to introduce the term 'conditional scaling' at this point rather than later, and to clarify the relationship to conditional biasing.
  • Add a "spoiler" sentence connecting CBN and FiLM when introducing CBN.
  • Merge the sentence "We can also use FiLM layers to condition a style transfer network on a chosen style image." into the next paragraph.
  • Should there be a sub-heading for self-conditioned models?
  • (Archy) Can you say something about how squeeze-and-excitation differs from the norm? i.e. all layers are conditioned on previous layers in a vanilla NN... I think this description is a little underspecified. Isn't SE more about allowing between-channel interactions than between-layer interactions?
  • Beef up the conclusion. What is it that our new formulation brings to the table? Why is thinking of these different things within a single family of bilinear transformations a useful exercise? Are there open questions we would like to see answered?

Initial FiLM diagram (Fig. 5)

This is the present version of the first diagram of a FiLM network (Fig. 5 as of 5857f08):

image

Some things to think about:

  • Should FiLM parameters really be boxes? Other boxes are neural net chunks, but it's just a vector.
  • Can this diagram be made less visually noisy?
    • The present grouping of the FILM-ed network is pretty heavy handed. You could probably get away without anything, just because of the visual gestalt of the column.
    • Is color really serving an important role, or is it primarily distracting?
    • Related: you probably only want to use color for one thing (eg. encoding positive/negative values) in a diagram and to use it consistently. Overloading color tends to be confusing.
  • Consider integrating your caption into the diagram. This is one of the best explanatory tricks I know.
    • My pop-psych story about why this is so powerful: When someone is trying to understand an idea, they usual have to hold lots of things in their heard. Since they only have about 7 slots of working memory, these are really in demand. When they need to do extra work of pulling snippets of the explanation out the caption or surrounding text, it deprives them of a slot or two.

Here's a very fast redesign. Take it with a grain of salt, but you might find it useful to consider. Notice how I've tried to reduce noise and emphasize the important parts. I've also integrated the explanation into the diagram a little bit, although it's possible you might want to go further. I've tried to preserve color for use with the positive/negative values.

image

Xavier's feedback

  • One one hand… on the other hand… implies the two themes are in opposition.
  • Maybe a quick schematic in the section comparing FiLM to attention showing spatial-attention vs. feature attention?

Feature-wise transformation section

  • Don’t only use colour to differentiate the generators, layers, and networks, or if you do check the colours for colour-blindness!
  • It’s hard to remember as you read down in the article which colour referred to which kind of element. At the very least re-including the legend would be helpful. I found the “hover” that replaces the name of the box with the annotation (“FiLMed network”, etc.) confusing. And not a great interaction on mobile, either.

Concatenation section

  • “An improvement over this idea would be to concatenate …”. This is repeating things from earlier in the article, which is ok but I got confused at first and thought that I was re-reading the same section. Some nod to that previous section would be good, like “as described above, an improvement over this idea would be to concatenate z to …”.
  • “This is in fact the approach adopted by the conditional variant of DCGAN [8], a well-recognized class of network architectures for generative adversarial networks (GANs)” Is the reader expected to know what a GAN is? The part is a bit funny because it explains the acronym GAN as though we aren’t familiar, but then doesn’t mention at all what they’re used for, and then doesn’t explain the acronym for DCGAN. I might say:

“Deep convolutional generative adversarial networks (DCGAN) are a popular class of generative model often used to create images. The conditional variant, where a single network can create images from multiple classes, uses exactly this concatenation idea, broadcasting the class label as a feature map and concatenating it to the input of the convolutional and transposed convolutional layers in the generator and discriminator networks”

That said, even this assumes a lot of domain knowledge about what exactly a GAN is, (i.e. what a generator and discriminator network is!) I’d be tempted to do a heavier re-work of the system and actually more concretely explain how it works.

Biasing section

  • It might be helpful to first explain what conditional biasing is, and then explain how it’s equivalent to concatenation. Right now it happens in the opposite order, I’m told it’s equivalent to concatenation and shown how, and I have to infer from that explanation what is meant by conditional biasing. Then a couple paragraphs later conditional biasing is explained.

FiLM section

  • Give examples in the main text body of what a textual question would be about an image. For instance “How many yellow blocks are there?”
    Similar for the GuessWhat?! Task. Give an example of a natural language question on an image.
  • Bolding for “conditional instance normalization” first-letters is odd.
  • I realize this isn’t a super actionable observation, but I found myself reading and re-reading this section, losing the thread of the narrative. It reads kind of like a “grab bag of places people have used FiLM”. Maybe a way to tighten it up would be to set up more context up in the first paragraph somewhere. “FiLM layers have proved useful on many problems, from answering natural language questions about photographs, quickly and flexibly transferring the “style” of one image to the content of another, and reinforcement learning on Atari games”.

Self-conditioning section

  • This section could really use a “toy” example of what self conditioning is before diving into an LSTM. Not unlike the great toy example from earlier showing why concatenation is the same as biasing.
  • “Utt. sum” is a pretty inscrutable abbreviation in the figure. Just make that box wider and write it out in full :)

Bilinear transformations

  • These figures should label what is x and what is z for clarity.

General thoughts

  • I find the numerical-citation technique “as shown in [23] you can …” a little confusing. I kept thinking there was a grammatical issue in a sentence before realizing that it was just because the subject of the sentence was a citation. This is exacerbated by the fact that the citation links render in a lightly coloured, super-scripted font, and so don’t look like they’re part of the text.
  • I would have loved an abstract or TL;DR at the top. It’s not obvious that you’re going to be introducing a single analytical framework that subsumes FiLM, attention, etc. despite that maybe being an exciting contribution of the article. (Maybe this is coming?)

Fix bib items

  • ref 12. update title from "Film..." to "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning"
  • ref 1. Put @fstrub95 before me on author list of FiLM

Attention vs FiLM

What do you think about adding a paragraph explaining why FiLM may have better results than advanced Attention mechanism.

In a few words (I am merely repeating Harm's intuitions), attention assumes that key information must be retrieved in by emphasizing specific spatial or temporal information (pixel, states in RNN), modulation assumes that some specific neural activations must be emphasized independently of their spatial/time location.
In other words, the two mechanisms are working along two different dimensions.
They would be complementary if features were independent one from another. Yet, it is not the case! Because of the convolution windows , close pixels-feature-map pixels have overlapping information and a convnet can already filter information according spatiality (esp if you provide a location mask), the same happen with RNN where the cell-state is updated. So there are redundant information from one hidden state to another or from one pixel to another. Thus, only working on the pixel/time dimension is sub-optimal as the network may have already performed an attention mechanism by itself.

In CLEVR, stacked-attention does not work, while FiLM does work.
In RNN, self-attention was often reported to have a limiting impact while feature-wise-max-pooling or layer-norm has been very suffessul (I have a few citations in both cases).

To conclude, performing FiLM already enable to perform a kind of attention by selecting the feature that natively do this attention process. On the other side, attention cannot perform a feature-wise selection.

Well, this late-draft is a bit confused, but I think you already got the point!

Florian

Gating section: $a_k$ undefined?

There seems to be a variable $a_k$ in the second equation that I can't find where it's defined - maybe that's meant to be another variable or was meant to be defined? Or maybe it's defined somewhere but not that easy to find nearby?

Discussion between Ethan and Vincent

  • Linear or MLP boxes might give the sense that the FiLM parameters are shared
  • "Normalization" or "whitening"?
  • Show just one LSTM cell in dynamic layer normalization
  • Is "ReLU layer" clear enough?
  • Simplify bilinear section by removing the explicit use of biasing and incorporating scaling and shifting coefficients into z directly.
  • "We conjecture (and in some sense Harm and Florian's paper suggest) that we may benefit from having those modalities interact in simple ways in more than one place along the computation of the model’s output."
  • Remove AdaIN? Or mention it briefly after squeeze-and-excitation?
  • Put squeeze-and-excitation before DLN?
  • "[...] employs a self-conditioning scheme in the form of feature-wise sigmoidal gating as a way to condition a layer's features on themselves."
  • We're not sure about the connection to DiSAN anymore.

Comments on Connecting FiLM, gating and attention

I think the connection is pretty neat :) Glad you found a way to cast attention as a generalized bilinear transformation!

Some thoughts when reading the section:

  • ``to accommodate affine transformationsnot sure if everyone understand the subtle difference between affine and linear transformations, so I would preferaccommodate bias terms```.
  • concerning attention: we have to be explicit that x is a matrix (rather than a vector) with a very unintuitive layout. Namely, rows correspond to features and columns to spatial locations. All implementations of attention I have seen use a [batch, locations, features] layout, so just emphasizing this might be useful when parsing the equations. For example, in the first equation of the paragraph (formalizing attention) we could be explicit about the matrix layout by changing x^t_k into X_{k, t}.

Order of explanation in conditional biasing

It might be helpful to first explain what conditional biasing is, and then explain how it’s equivalent to concatenation. Right now it happens in the opposite order, I’m told it’s equivalent to concatenation and shown how, and I have to infer from that explanation what is meant by conditional biasing. Then a couple paragraphs later conditional biasing is explained.

Xavier

Meeting points

  • sequential nature of attention
  • even more naive: one network per class
  • highlight parameter sharing
  • remove "so-called"
  • in the introduction, talk about why we want to consider the specific family of feature-wise affine transformations
  • add a visual example of conditioning
  • concat: input top-left
  • involve chris olah et al
  • class-conditional -> conditional. make it clear that this is an example, and conditioning is more than just class-conditioning.
  • replace PixelCNN figure with simpler example of conditional biasing.
  • FiLM-ed layer and FiLM layer are too close phonetically.
  • Idea for animation: class-conditional generative model on MNIST, identify which units are activated for which classes.

Contact Hochreiter

He most probably has written papers on the subject of multiplicative interactions, or feature-wise interactions, so it'd be good to have his input.

General diagram advice

Reduce visual noise

  • Get rid of anything that isn't necessary for the diagram's purpose
  • Emphasize the things you want your reader to focus on. De-emphasize things you don't want them to focus on.
  • Use color sparingly

Integrate Text

  • Look at the caption below each image and ask what parts related to which section of the image and whether you can place it there.
  • Consider reading through the momentum article or the attention article and pay attention to the way text is integrated into diagrams.

Specific comments on a few diagrams

image

Is the loss actually important in the above diagram? Or is it a distraction from the core point?


image

I'm not sure the present diagram helps me understand what's going on very much with in this diagram. I think you could probably make what is actually going on clearer by showing explicit pointwise multiplication. and putting a bit of text with the AdalN block.


image

Is the residual block correct in this diagram? Shouldn't the skip go over both convolutions?

I do wonder if there's a way you could present that diagram without having to do the zoom in. It seems plausible that the right simplification could allow you to explicitly have FiLM blocks in the FilM-ed network diagram.

How to discuss feature-wise multiplicative vs. additive interactions

In the "Attention over features" section, it's stated that "Both papers [on gated-attention] show improvement over conditioning via concatenation." Overall, we bring up feature-wise biasing and feature-wise scaling, so it's worth chatting about how we want to handle this discussion.

Should we discuss performance comparisons between concatenation/feature-wise biasing and feature-wise multiplication/gating? It seems like this isn't the point of our paper, and that these distinctions might distract readers; we don't want to reinforce that multiplicative vs. additive interactions is the main difference (which isn't what the FiLM paper found), but rather that the feature-wise aspect is important.

I'd be in favor of replacing sentences like "Both papers [on gated-attention] show improvement over conditioning via concatenation." with something more specific on how the feature-wise aspect helps learning. I.e. "\cite{Gated-Attention Architectures for Task-Oriented Language Grounding} shows that feature-wise modulation enables agents to learn to follow language instructions in reinforcement learning by modulating which features in its visual pipeline/convolutional neural network will be important for its downstream policy network for instruction-following. Only features relevant to the particular categories of object types referenced by the language instruction are significantly activated."

I think this is my fault (I must have included this information in the literature review), but I just realized it reading over the draft.

Style transfer network (Fig. 14)

Original diagram:

image

Potential redesign:

image

Or if you want to use color:

(I actually really like the effect of adding color like this. It makes the diagram feel kind of warm and friendly, and increases the division between the generator and the FiLMed network. If you can isolate the use of color for positive/negative from place where you use it for organization, and pick different color schemes, it may be a good idea to do this.)

image

Things to pay attention to:

  • We get rid of all the containers, but achieve the same goals with organization and annotation.
  • The loss doesn't seem relevant to the core message. If we get rid of it, we can simplify the layout a lot.
  • Integrate text.

First two figures don't render on Safari

Running in Python3, the Zoolander and CLEVR figures don't render. Neither does the puppy in the fourth figure, so it's a problem with your nicely vectorized images.

Actually - could you label the figures with numbers too?

Acknowledgements

The following people should be acknowledged for their feedback:

  • Pedro Oliveira Pinheiro (Element AI)
  • Alexei (Element AI)
  • Minh Dao (Element AI)
  • Masha Krol (Element AI)
  • Archy de Berker (Element AI)
  • Dzmitry Bahdanau (MILA)
  • Roland Memisevic (TwentyBN)
  • Xavier Snelgrove (Element AI)

Add a ReLU nonlinearity in the FiLM figure

Should help give the reader a sense of what we can accomplish when combining FiLM and ReLU.

We should be careful in adding that element to the figure, however: we don't want to give the impression that FiLM is always combined with ReLUs, so we need to find a way to make that somehow distinct from the rest of the figure.

Write analysis section

  • Are the FiLM parameters interpretable in some way?
    • Is it unique to FiLM, or is something like conditional biasing, conditional scaling, sigmoid gating also exhibiting interpretable coefficients?
  • Is FiLM being used as a gating-like mechanism, i.e., to control which feature maps get passed onto the following layers?
  • Do we see evidence that FiLM is taking full advantage of its affine transformations, or instance by using the amplification of activations to pass things through?
  • What do the gammas look like in distribution for a sigmoidal gating network?
  • Do the observations we're making apply to problem settings other than CLEVR, like style transfer?
  • Can we find evidence supportive of the soft numerical program interpretation?
  • Why is repeated conditioning important?
  • Is the space spanned by FiLM parameters "linear" in some way?
    • Is it unique to FiLM, or is something like conditional biasing, conditional scaling, sigmoid gating also exhibiting interpretable coefficients?
  • Magnitude of gamma
  • Gamma vs. beta
  • Conditioning throughout the hierarchy

New paper

Hi,
I found two papers at NIPS that should be integrated to the survey:

-Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning Idem: they train a network (generator) from external input and then predict a convnet.

@ethanjperez : the last paper is also very interesting as the author applied the same zero shot-approach as we used for film to compute a convnet that will be used for the policy network. Yet, they have to regulate the loss to enforce a linear combination, while we did not have to apply this trick.

If you agree that the papers are relevant, I can write a short summary.

Pedro's feedback

FiLM as low-rank bilinear

In the current description it is not obvious that FiLM is a 1-rank approximation to a bilinear transformation. You've added the bias term, which requires appending ones to the input, so I'm not sure if we can still explicitly cast it to a rank-1 W matrix.

What do you think?

FiLM Layer Diagram (Fig 6)

Initial version:

image

Mock up of possible alternative (reduce noise, integrate exposition, etc).

Note that I also flipped the vertical flow of the diagram, so that the process aligns with reading direction. This makes it more natural to integrate exposition.

image

Things to add

  • Discuss the nature of problems in which FiLM would be a good fit. E.g., if your tasks are really related (like is the case for CLEVR), there is a lot of overlap in computation. (Be careful in phrasing it: we don't want to undersell how much FiLM is able to modulate the FiLMed network.) (-> Harm)
  • Examine David Krueger's HyperBayes paper. (-> Aaron)
  • Besides VQA, do we know of more domains in which bilinear methods have been effectively applied? (-> Florian)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.