yugeten / mlss2019_notes Goto Github PK

mlss2019_notes's Introduction

Howdy

Notes for Machine Learning Summer School 2019 in Moscow, Russia. See issues.

These notes are selective and a little sloppy, feel free to checkout the official git repository for MLSS2019: slides, tutorial

Schedule

mlss2019_notes's People

Contributors

Stargazers

Watchers

Forkers

sanyamlakhanpal juampamuc

mlss2019_notes's Issues

Bayesian deep learning (Yarin Gal)

Bayesian deep learning

slides
When deep learning is combined with probability theory we can capture uncertainty in a principled way, known as Bayesian Deep Learning.

Definition: We make some assumptions about how data was generated, and that there exists some underlying process for the generated observations. Because these processes are not explicit, we want to infer them.

Inference

How to inference

We want to infer W (posterior). After some information theory magic, we can find the mean and variance:

Predictions

Reinforcement learning (Shimon Whiteson)

Reinforcement learning

slides

How can an intelligent agent learn from experience how to make decisions
that maximise its utility in the face of uncertainty?

Unlike unsupervised learning we do have feedback, but it comes in the form of reward. This feedback is however not as strong as supervised learning.

Definition and intuition

In RL an agent tries to solve a control problem by directly interacting with an unfamiliar environment. The agent must learn by trial and error, trying out actions to learn about their sequences.

Difference to supervised learning

Agent has partial control over what data it collects in the future;
No right and wrong, just rewards for actions;
The agent must learn on-line: must maximise performance during learning, not afterwards.

The reward must be quantifiable -- hence the reward design problem: what is the definition of "good" or "bad" action?

When the reward design is poor, the agent can have undesirable behaviour.

K-armed bandit problem

Setting: you're an octopus, sitting before a slot machine (bandit) with many arms, where each arm has an unknown stochastic payoff. The goal is to maximise cumulative payoff over some period.

Formalising:

Infinite-horizon problem is for example Google's search engine. \gamma tells you how much you care about future, i.e. how much you care about instant gratification vs. long term reward.

Exploration and exploitation

Explore: explore the arms in order to learn about them and improve its chances of getting future reward;

Exploit: focus on the most profitable arm and get the largest reward.

How to balance between explore and exploit mode?

Horizon is finite: exploration should decrease as horizon gets closer
Horizon is infinite but \gamma < 1: exploration should decrease as the agent's uncertainty about the expected rewards go down
Horizon is infinite and \gamma = 1: infinitely delayed splurge, you have infinite future ahead of you, so you always explore and delay gratification infinitely.

To address this we have ---

Action value methods

A few different methods:

epsilon-greedy:
softmax exploration: concentrate exploration on most promising arms.
Upper confidence bound:
Neither epsilon-greedy nor softmax considers uncertainty in action-value estimates, while goal of exploration is to reduce uncertainty. So focus exploration on most uncertain actions. This uses the principle of optimism in the face of uncertainty.

Focus on arms that are promising and uncertain for exploration.

Contextual bandit problem

So it's a traditional bandit problem, but instead of having Q conditioned on action it conditions on action and state together.

Our action doesn't just affect reward, it also affects the state of the world:

The credit-assignment problem: suppose an agent takes a long sequence of actions, at the end of which is receives a large reward. How does it determine to what degree each action in that sequence is responsible for the resulting reward?

Markov decision processes: formalising the RL problem

stationary: rule of physics isn't changing
stochastic: a bit random

MDP example: recycling robot

The Markov property

The current state is sufficient for the agent's history, conditioning actions on the rest of the history cannot possibly help. This restricts search to reactive politcy:

Is it Markov?

Robot in a maze, state is wall or no wall or 4 sides, action is up, down, left, right: no
Chess, state is board position, action is legal move: yes (minus the special rules)

Return: value function and Bellman equation

We now need to reason about long term consequences, and we can do so by maximising the expected return, which is the sum over the rewards received.

Value function: Value functions are the primary tool for reasoning about future reward. It is the expected value of the policy.

note that in action-value the action can deviate from the policy (doesn't have to be in the policy)

Bellman equation: we are not sure what the next state will be so we take the expectation over all actions as well. This is commonly known as bootstrapping:

The difference between the two equations is whether we bootstrap over state or state-action pair.

To gain some intuitions:

Look at the LHS of the tree, corresponding to the first equation:

The first extension of the tree, we look at all the action -- first summation
The second extension of the tree, we look at all the stochastic outcome -- second summation

Optimal value functions

The Bellman optimality equations express this recursively: we replace the Bellman equation's expectation over actions with a maximisation wrt action:

A recap -- P: transition function, R: reward function

Planning with MDP

MDPs give us a formal model of sequential decision making. Given the optimal value function, computing an optimal policy is straightforward. But how can we find V* or Q*?

Algorithms for MDP planning compute the optimal value function given a complete model of the MDP. Given a model, V* is usually sufficient.

Dynamic programming approach

We start off with arbitrary policy (pi), then we compute the true value function for the arbitrary policy (V). Then we use the value function to figure out an incremental improvement for our policy. Repeat the process until you reach optimal point.

We can use the Bellman equation to exploit the relationship between states (instead of estimating each state independently). Initial value function is chosen arbitrarily.

Policy evaluation update rule

You estimate by summing up all the current state. Apply to every state in each sweep of the state space. Repeat over many sweeps, it will eventually converge to fixed point where V_k = V^pi. (we get the true value of the arbitrary policy).

More intuitively:

We start from a random point, first perform policy evaluation, which get us to where value is exactly that of the policy (V = V^pi). We perform policy improvement, at which point the value function diverges from true value function. Keeps interating and you converge to the optimal value.

Guaranteed convergence: how?? (i.e. how do you know these two lines are intersecting with each other and you can always get to an optimal point) in here we are doing closed-loop update, which means the action always have the opportunity to condition on the state.

"A counter example": say you take two actions, left and right and you have two timesteps. if these are the true values corresponding to action:

LL yields return of 5
LR yields return of 0
RL yields return of 0
RR yields return of 10
So if you start from LL, will you be stuck in a local maximum?

This is not true because of exactly the closed loop update thing -- we will consider all states of every action, so even when we take the first step to be left, Bellman equation also considers the state of if you had taken a step right. Then when we try to find the optimal action, it will figure out that taking two steps to the right gives the most value.

Value interation

We do not always have to wait for the policy evaluation to complete before doing policy improvement -- so we can do 5 policy updates before doing 1 policy evaluation.

Here we take the Bellman optimality equation and turn it into an update rule.

MC methods

MC provides one way to perform reinforcement learning: finding
optimal policies without a priori models of MDP
MC for RL learns from complete sample returns in episodic tasks:
uses value functions but not Bellman equations

So this is not a tree anymore, we just keep performing rollout and calculate the reward (this is like a depth search vs. the Bellman is a width search)

We just carry out the entire policy in the real world.

On-policy MC control

Caveat: Converges to the best �epsilon-soft policy rather than the best best policy.

Off-policy MC control

To avoid the caveat, we can do off-policy MC, which allows us to have a different estimation policy to behaviour policy. This is done using importance sampling.

The variance depends on the difference between the target policy and the actual policy, if the two policies defer too much then the capacity of the estimation is very limited.

Temporal-difference methods

TD(0): estimation of V

Pseudo code:

Difference between TD(0), MC and DP

Advantages

TD methods require only experience, not a model
TD, but not MC, methods can be fully incremental (we don't have to wait until the end of the episode to do an update)
Learn before final outcome: less memory and peak computation
Learn without the final outcome: from incomplete sequences
Both MC and TD converge but TD tends to be faster

Sarsa: estimation of Q

Bootstrapping off state-action pair rather than just the next state.

It's not a policy evaluation algorithm, but an optimisation algorithm. It doesn't try to find Q^pi, it tries to find Q*:

Expected Sarsa

Compute the expectation of the action explicitly rather than just taking one sample of it, to reduce variance in updates:

Q-learning: off-policy TD control

In MC, we can do on-policy and off-policy, and we can also do that in temporal learning.

When we do maximisation over all actions, we are no longer considering the expectation of the policy/taking a sample in the policy, so this is off-policy.

Summary: a unified view

Kernels (Arthur Gretton)

Kernels

slides

Kernels and feature space

Why

Kernel methods make it possible to separate the XOR cases that cannot be deal with by any linear classifier;
Kernel methods can control smoothness and avoid overfitting/underfitting.

What

Hilbert space: inner product space containing Cauchy sequence limits.
Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)

New kernels from old

if a, b are kernels, c is constant, then ca and a+b are both kernels. (proof via positive definiteness)
The difference of two kernels might not be a kernel (because the length of a kernel cannot be negative, but calculating the difference between two kernels permits that)
Products of kernels are kernels
1,3 --> polynomial kernels! (expands to sums and products)

Exponential quadratic kernel

As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)

Positive definite functions

Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.

This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.

The REPRODUCING Hilbert Kernel Space

You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product.
Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:

2 defining features of RKHS

The reproducing property (aka kernel trick):

2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)

Understanding smoothness of RKHS

Consider a Fourier Series:

We want to use Fourier series to approximate the "top hat" function:

(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)

The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:

Comparing the dot product of Fourier series vs. in Hilbert space:

We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:

Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.

In short, the core property of RKHS kernel is its smoothness enforced by small norm.

Reproducing property:

A unique property of RKHS

The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.

MMD and GANs

slides

Maximum mean discrepancy

Illustration

MMD as an integral probability metric

A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:

Which is still not too bad because P and Q are very similar.

How does MMD "maximises the mean discrepency"?

f is optimal (maximised) when it's in the same direction as mean discrepancy.

Divergences

Question: do we look at the difference or the ratio?

Two sample testing with MMD

Statistical testing language

Example: idependent P and Q

Draw n=200 i.i.d. samples from P and Q

Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:

Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)

Example: P and Q are the same

Side note: notice from above, MMD can be negative! Because you're calculating the discrepancy between two samples with the same mean, MMD is an unbiased estimation of a distribution with zero mean, therefore MMD can be negative sometimes

It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.

Summary

How to get test threshold c?

Now we permute the dataset:

And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.

Choosing kernels

maximising test power == minimising false negatives

we want the blue area to be as small as possible.

To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:

So, to maximise test power, we just need to maximise the first term. (check out code)

GAN: real vs. fake

Distinguishing from real to fake digits:

Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.

So while humans are not able to distinguish the differences, statistical tests are very confident.

Tutorial take-away

See tutorial and solution

In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.

GAN: training

Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).

F-divergence as critic

Unhelpful, the following two casesboth have D_js = log2

In practice, we:

Use variational approximation to the critic, alternate generator and critic training
Add instance noise to the reference and generator observations (or a gradient penalty for the variational critic)

Wasserstein as critic

Helpful, the distance decreases when we get closer to the real distribution

MMD as critic

Wide kernels are helpful, narrow kernels are unhelpful:

To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!

How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points

So now when we push input through NN first, look at the witness function:

There are lots of distortions!

MMD: helpful vs. unhelpful example

If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.

The gradient when using kernels k(x, y):

kernel:

This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):

Kernels:

This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.

What to do with MMD? Regularisation!

We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):

This is now a bit more helpful:

Early stages: diffused gradient, tells you where to move your distribution

Late stages: concentrated gradient

The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.

Observations on GAN

To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)

DO NOT TRAIN CRITIC TO CONVERGENCE

WGAN-GP

Evaluation of GAN: inception score

Based on the classification output p(y|x) of the inception model:

High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.

Evaluation of GAN: Frechet inception distance (FID)

It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.

Evaluation of GAN: kernel inception distance (KID)

Dependence testing

slides

To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.

Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.

BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.

But, you can always find a smooth transformation of the data and find a high correlation between the two.

To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.

Optimal Transport (Marco Cuturi)

Optimal transport

(see lecture notes and tutorial)

Introduction

Two examples:

Moving earth & Soldiers Monge problem: what is the most efficient way to bring earth from one place to another

How to move the sand to fill the hole most efficiently? Characterise the work involved here by the product between mass and the moving distance:

Exact solution: linear programming

Sinkhorn Algorithm for Entropy Regularized Optimal Transport

Dealing with curse of high dimensionality

Sliced Wasserstein distance:

PCA projection

k-dim (robust) projection

Applications: Average measures

k-means

You can consider W distance as a k-mean algorithm, if the dimension of X is much higher than the dimension of Y.

Wasserstein Barycenter:

Brain imaging

Mapping visual stimulus to different cortex of the brain using MEG. Different subject will have slightly different response -- to account for this spatial variation one can use Wasserstein average.

KL, MMD vs. Wasserstein

there is no geometry in KL/MMD, they're better for high dimensional data. For KL, when you have two Gaussians very close to each other, if the variance goes to zero, the divergence can go to infinity

Wasserstein vs. L2 averages

Domain adaptation

Learning with Wasserstein loss:

Sorting

On 1D, calculating the Wasserstein plan is equivalent to sorting (because the nth ranking point in X is always going to map to the nth ranking in Y)