Feature-Wise Transformations

Home Page: https://distill.pub/2018/feature-wise-transformations

License: Creative Commons Attribution 4.0 International

HTML 78.83% CSS 1.19% TeX 10.21% JavaScript 9.77%

post--feature-wise-transformations's Introduction

Distill submission repository

Submission document

The submission document is located at src/index.ejs.

Developing

npm install to install dependencies. (Can take minutes!)

npm run dev to run a watching server.

npm run build to build, transpile, babel-ify and minify files.

Components are in src. The .html files are svelte components, the .js files are compilation endpoints that are also defined in webpack.config.js. These compiled endpoints are then consumed by hand authored .ejs files in src.

Visit localhost:8080/index.html for a hot-reloading preview of the article.

post--feature-wise-transformations's People

Contributors

Stargazers

Watchers

Forkers

tigerneil chomolungma databill86 sreev leventkaracan stjordanis knut0815 soongxueyong standardgalactic

post--feature-wise-transformations's Issues

Rephrase "one one hand [...] on the other hand"

In the intro, "one one hand [...] on the other hand" implies the two themes are in opposition. We should rephrase that.

First two figures don't render on Safari

Running in Python3, the Zoolander and CLEVR figures don't render. Neither does the puppy in the fourth figure, so it's a problem with your nicely vectorized images.

Actually - could you label the figures with numbers too?

FiLM Layer Diagram (Fig 6)

Initial version:

Mock up of possible alternative (reduce noise, integrate exposition, etc).

Note that I also flipped the vertical flow of the diagram, so that the process aligns with reading direction. This makes it more natural to integrate exposition.

New paper

Hi,
I found two papers at NIPS that should be integrated to the survey:

Incorporating Side Information by Adaptive Convolution
They actually learn the full convnet from external information...

-Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning Idem: they train a network (generator) from external input and then predict a convnet.

@ethanjperez : the last paper is also very interesting as the author applied the same zero shot-approach as we used for film to compute a convnet that will be used for the policy network. Yet, they have to regulate the loss to enforce a linear combination, while we did not have to apply this trick.

If you agree that the papers are relevant, I can write a short summary.

Correction for "Modulating early visual processing by language"

Results are not only presented for GuessWhat but also for VQA.

FiLM as low-rank bilinear

In the current description it is not obvious that FiLM is a 1-rank approximation to a bilinear transformation. You've added the bias term, which requires appending ones to the input, so I'm not sure if we can still explicitly cast it to a rank-1 W matrix.

What do you think?

General diagram advice

Reduce visual noise

Get rid of anything that isn't necessary for the diagram's purpose
Emphasize the things you want your reader to focus on. De-emphasize things you don't want them to focus on.
Use color sparingly

Integrate Text

Look at the caption below each image and ask what parts related to which section of the image and whether you can place it there.
Consider reading through the momentum article or the attention article and pay attention to the way text is integrated into diagrams.

Specific comments on a few diagrams

Is the loss actually important in the above diagram? Or is it a distraction from the core point?

I'm not sure the present diagram helps me understand what's going on very much with in this diagram. I think you could probably make what is actually going on clearer by showing explicit pointwise multiplication. and putting a bit of text with the AdalN block.

Is the residual block correct in this diagram? Shouldn't the skip go over both convolutions?

I do wonder if there's a way you could present that diagram without having to do the zoom in. It seems plausible that the right simplification could allow you to explicitly have FiLM blocks in the FilM-ed network diagram.

Investigate and discuss domains in which bilinear methods are applied

Besides VQA, do we know of more domains in which bilinear methods have been effectively applied?

Incorporate information about earlier papers using three-way interactions

Make figure comparing attention and FiLM

How to discuss feature-wise multiplicative vs. additive interactions

In the "Attention over features" section, it's stated that "Both papers [on gated-attention] show improvement over conditioning via concatenation." Overall, we bring up feature-wise biasing and feature-wise scaling, so it's worth chatting about how we want to handle this discussion.

Should we discuss performance comparisons between concatenation/feature-wise biasing and feature-wise multiplication/gating? It seems like this isn't the point of our paper, and that these distinctions might distract readers; we don't want to reinforce that multiplicative vs. additive interactions is the main difference (which isn't what the FiLM paper found), but rather that the feature-wise aspect is important.

I'd be in favor of replacing sentences like "Both papers [on gated-attention] show improvement over conditioning via concatenation." with something more specific on how the feature-wise aspect helps learning. I.e. "\cite{Gated-Attention Architectures for Task-Oriented Language Grounding} shows that feature-wise modulation enables agents to learn to follow language instructions in reinforcement learning by modulating which features in its visual pipeline/convolutional neural network will be important for its downstream policy network for instruction-following. Only features relevant to the particular categories of object types referenced by the language instruction are significantly activated."

I think this is my fault (I must have included this information in the literature review), but I just realized it reading over the draft.

Write a textual description of FiLM

Discuss input-only concatenation

Create literature survey document

The survey should be contained in literature_survey.md.

Pedro's feedback

Question : is it possible to put some comments in the html?

I want to comment a few points without changing the html: I am a terrible writer...
Do you know it is possible to do it with Github?

Remove neuroscience connection?

Fix bib items

ref 12. update title from "Film..." to "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning"
ref 1. Put @fstrub95 before me on author list of FiLM

Comments on Connecting FiLM, gating and attention

I think the connection is pretty neat :) Glad you found a way to cast attention as a generalized bilinear transformation!

Some thoughts when reading the section:

``to accommodate affine transformationsnot sure if everyone understand the subtle difference between affine and linear transformations, so I would preferaccommodate bias terms```.
concerning attention: we have to be explicit that x is a matrix (rather than a vector) with a very unintuitive layout. Namely, rows correspond to features and columns to spatial locations. All implementations of attention I have seen use a [batch, locations, features] layout, so just emphasizing this might be useful when parsing the equations. For example, in the first equation of the paragraph (formalizing attention) we could be explicit about the matrix layout by changing x^t_k into X_{k, t}.

Discuss the nature of problems in which FiLM would be a good fit

E.g., if your tasks are really related (like is the case for CLEVR), there is a lot of overlap in computation.

(Be careful in phrasing it: we don't want to undersell how much FiLM is able to modulate the FiLMed network.)

Turn literature survey document into an article section

Meeting points

Get the artistic style transfer codebase up and running

Get hold of the style transfer codebase.
Stylize a test image using the pre-trained model.

Another NIPS paper to discuss

Learning multiple visual domains with residual adapters.

@harmdevries89 would you mind taking responsibility for integrating a mention of that paper into the article?

t-SNE animation: remove transition?

Acknowledgements

The following people should be acknowledged for their feedback:

Contact Hochreiter

He most probably has written papers on the subject of multiplicative interactions, or feature-wise interactions, so it'd be good to have his input.

Add reference to new work: CBN for Video Object Segmentation

There's a new arXiv paper "Efficient Video Object Segmentation via Network Modulation" (https://arxiv.org/pdf/1802.01218.pdf) which uses CBN to produce object segmentations on video! Pretty neat - worth including as an application of CBN in another domain? At any rate, it's cool to see it works on video too as we'd suspected!

Add mention to Highway networks

See #65.

Discussion between Ethan and Vincent

Style transfer network (Fig. 14)

Original diagram:

Potential redesign:

Or if you want to use color:

(I actually really like the effect of adding color like this. It makes the diagram feel kind of warm and friendly, and increases the division between the generator and the FiLMed network. If you can isolate the use of color for positive/negative from place where you use it for organization, and pick different color schemes, it may be a good idea to do this.)

Things to pay attention to:

We get rid of all the containers, but achieve the same goals with organization and annotation.
The loss doesn't seem relevant to the core message. If we get rid of it, we can simplify the layout a lot.
Integrate text.

Color gradient in AdaIN

Incorporate information about earlier papers using three-way interactions

See #52.

Bilinear VQA Model

Apparently people have actually used bilinear transformations to condition image computation on language for VQA: https://arxiv.org/pdf/1606.01847.pdf

Pretty neat/relevant! Worth adding a citation to / mentioning in the post I'd say

Build a plan for the Distill submission

The plan should be contained in plan.md.

Initial FiLM diagram (Fig. 5)

This is the present version of the first diagram of a FiLM network (Fig. 5 as of 5857f08):

Some things to think about:

Should FiLM parameters really be boxes? Other boxes are neural net chunks, but it's just a vector.
Can this diagram be made less visually noisy?
- The present grouping of the FILM-ed network is pretty heavy handed. You could probably get away without anything, just because of the visual gestalt of the column.
- Is color really serving an important role, or is it primarily distracting?
- Related: you probably only want to use color for one thing (eg. encoding positive/negative values) in a diagram and to use it consistently. Overloading color tends to be confusing.
Consider integrating your caption into the diagram. This is one of the best explanatory tricks I know.
- My pop-psych story about why this is so powerful: When someone is trying to understand an idea, they usual have to hold lots of things in their heard. Since they only have about 7 slots of working memory, these are really in demand. When they need to do extra work of pulling snippets of the explanation out the caption or surrounding text, it deprives them of a slot or two.

Here's a very fast redesign. Take it with a grain of salt, but you might find it useful to consider. Notice how I've tried to reduce noise and emphasize the important parts. I've also integrated the explanation into the diagram a little bit, although it's possible you might want to go further. I've tried to preserve color for use with the positive/negative values.

Ask Yoshua about paper on hard experts

Dense-Net as a self-conditioning via concatenation?

Is Dense-Net a good example of self-conditioning via concatenation?

Simplify bilinear transformation section

Things to add

Discuss the nature of problems in which FiLM would be a good fit. E.g., if your tasks are really related (like is the case for CLEVR), there is a lot of overlap in computation. (Be careful in phrasing it: we don't want to undersell how much FiLM is able to modulate the FiLMed network.) (-> Harm)
Examine David Krueger's HyperBayes paper. (-> Aaron)
Besides VQA, do we know of more domains in which bilinear methods have been effectively applied? (-> Florian)

GLU: feature subset A, feature subset B

Xavier's feedback

One one hand… on the other hand… implies the two themes are in opposition.
Maybe a quick schematic in the section comparing FiLM to attention showing spatial-attention vs. feature attention?

Feature-wise transformation section

Don’t only use colour to differentiate the generators, layers, and networks, or if you do check the colours for colour-blindness!
It’s hard to remember as you read down in the article which colour referred to which kind of element. At the very least re-including the legend would be helpful. I found the “hover” that replaces the name of the box with the annotation (“FiLMed network”, etc.) confusing. And not a great interaction on mobile, either.

Concatenation section

“An improvement over this idea would be to concatenate …”. This is repeating things from earlier in the article, which is ok but I got confused at first and thought that I was re-reading the same section. Some nod to that previous section would be good, like “as described above, an improvement over this idea would be to concatenate z to …”.
“This is in fact the approach adopted by the conditional variant of DCGAN [8], a well-recognized class of network architectures for generative adversarial networks (GANs)” Is the reader expected to know what a GAN is? The part is a bit funny because it explains the acronym GAN as though we aren’t familiar, but then doesn’t mention at all what they’re used for, and then doesn’t explain the acronym for DCGAN. I might say:

“Deep convolutional generative adversarial networks (DCGAN) are a popular class of generative model often used to create images. The conditional variant, where a single network can create images from multiple classes, uses exactly this concatenation idea, broadcasting the class label as a feature map and concatenating it to the input of the convolutional and transposed convolutional layers in the generator and discriminator networks”

That said, even this assumes a lot of domain knowledge about what exactly a GAN is, (i.e. what a generator and discriminator network is!) I’d be tempted to do a heavier re-work of the system and actually more concretely explain how it works.

Biasing section

It might be helpful to first explain what conditional biasing is, and then explain how it’s equivalent to concatenation. Right now it happens in the opposite order, I’m told it’s equivalent to concatenation and shown how, and I have to infer from that explanation what is meant by conditional biasing. Then a couple paragraphs later conditional biasing is explained.

FiLM section

Give examples in the main text body of what a textual question would be about an image. For instance “How many yellow blocks are there?”
Similar for the GuessWhat?! Task. Give an example of a natural language question on an image.
Bolding for “conditional instance normalization” first-letters is odd.
I realize this isn’t a super actionable observation, but I found myself reading and re-reading this section, losing the thread of the narrative. It reads kind of like a “grab bag of places people have used FiLM”. Maybe a way to tighten it up would be to set up more context up in the first paragraph somewhere. “FiLM layers have proved useful on many problems, from answering natural language questions about photographs, quickly and flexibly transferring the “style” of one image to the content of another, and reinforcement learning on Atari games”.

Self-conditioning section

This section could really use a “toy” example of what self conditioning is before diving into an LSTM. Not unlike the great toy example from earlier showing why concatenation is the same as biasing.
“Utt. sum” is a pretty inscrutable abbreviation in the figure. Just make that box wider and write it out in full :)

Bilinear transformations

These figures should label what is x and what is z for clarity.

General thoughts

I find the numerical-citation technique “as shown in [23] you can …” a little confusing. I kept thinking there was a grammatical issue in a sentence before realizing that it was just because the subject of the sentence was a citation. This is exacerbated by the fact that the citation links render in a lightly coloured, super-scripted font, and so don’t look like they’re part of the text.
I would have loved an abstract or TL;DR at the top. It’s not obvious that you’re going to be introducing a single analytical framework that subsumes FiLM, attention, etc. despite that maybe being an exciting contribution of the article. (Maybe this is coming?)

Write analysis section

Are the FiLM parameters interpretable in some way?
- Is it unique to FiLM, or is something like conditional biasing, conditional scaling, sigmoid gating also exhibiting interpretable coefficients?
Is FiLM being used as a gating-like mechanism, i.e., to control which feature maps get passed onto the following layers?
Do we see evidence that FiLM is taking full advantage of its affine transformations, or instance by using the amplification of activations to pass things through?
What do the gammas look like in distribution for a sigmoidal gating network?
Do the observations we're making apply to problem settings other than CLEVR, like style transfer?
Can we find evidence supportive of the soft numerical program interpretation?
Why is repeated conditioning important?
Is the space spanned by FiLM parameters "linear" in some way?
- Is it unique to FiLM, or is something like conditional biasing, conditional scaling, sigmoid gating also exhibiting interpretable coefficients?
Magnitude of gamma
Gamma vs. beta
Conditioning throughout the hierarchy

Attention vs FiLM

What do you think about adding a paragraph explaining why FiLM may have better results than advanced Attention mechanism.

In a few words (I am merely repeating Harm's intuitions), attention assumes that key information must be retrieved in by emphasizing specific spatial or temporal information (pixel, states in RNN), modulation assumes that some specific neural activations must be emphasized independently of their spatial/time location.
In other words, the two mechanisms are working along two different dimensions.
They would be complementary if features were independent one from another. Yet, it is not the case! Because of the convolution windows , close pixels-feature-map pixels have overlapping information and a convnet can already filter information according spatiality (esp if you provide a location mask), the same happen with RNN where the cell-state is updated. So there are redundant information from one hidden state to another or from one pixel to another. Thus, only working on the pixel/time dimension is sub-optimal as the network may have already performed an attention mechanism by itself.

In CLEVR, stacked-attention does not work, while FiLM does work.
In RNN, self-attention was often reported to have a limiting impact while feature-wise-max-pooling or layer-norm has been very suffessul (I have a few citations in both cases).

To conclude, performing FiLM already enable to perform a kind of attention by selecting the feature that natively do this attention process. On the other side, attention cannot perform a feature-wise selection.

Well, this late-draft is a bit confused, but I think you already got the point!

Florian

Create diagrams explaining FiLM for MLPs and CNNs

Order of explanation in conditional biasing

It might be helpful to first explain what conditional biasing is, and then explain how it’s equivalent to concatenation. Right now it happens in the opposite order, I’m told it’s equivalent to concatenation and shown how, and I have to infer from that explanation what is meant by conditional biasing. Then a couple paragraphs later conditional biasing is explained.

Xavier

distillpub / post--feature-wise-transformations Goto Github PK