GithubHelp home page GithubHelp logo

mlss2019_notes's Introduction

Howdy

Notes for Machine Learning Summer School 2019 in Moscow, Russia. See issues.

These notes are selective and a little sloppy, feel free to checkout the official git repository for MLSS2019: slides, tutorial

Schedule

w1_schedule

w2_schedule

mlss2019_notes's People

Contributors

yugeten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mlss2019_notes's Issues

Bayesian deep learning (Yarin Gal)

Bayesian deep learning

slides
When deep learning is combined with probability theory we can capture uncertainty in a principled way, known as Bayesian Deep Learning.

Definition: We make some assumptions about how data was generated, and that there exists some underlying process for the generated observations. Because these processes are not explicit, we want to infer them.

Inference

drawing

How to inference

drawing

We want to infer W (posterior). After some information theory magic, we can find the mean and variance:

drawing

Predictions

drawing

Reinforcement learning (Shimon Whiteson)

Reinforcement learning

slides

How can an intelligent agent learn from experience how to make decisions
that maximise its utility in the face of uncertainty?
image
Unlike unsupervised learning we do have feedback, but it comes in the form of reward. This feedback is however not as strong as supervised learning.

Definition and intuition

In RL an agent tries to solve a control problem by directly interacting with an unfamiliar environment. The agent must learn by trial and error, trying out actions to learn about their sequences.

Difference to supervised learning

  1. Agent has partial control over what data it collects in the future;
  2. No right and wrong, just rewards for actions;
  3. The agent must learn on-line: must maximise performance during learning, not afterwards.

The reward must be quantifiable -- hence the reward design problem: what is the definition of "good" or "bad" action?

When the reward design is poor, the agent can have undesirable behaviour.

K-armed bandit problem

Setting: you're an octopus, sitting before a slot machine (bandit) with many arms, where each arm has an unknown stochastic payoff. The goal is to maximise cumulative payoff over some period.

image
Formalising:
image
Infinite-horizon problem is for example Google's search engine. \gamma tells you how much you care about future, i.e. how much you care about instant gratification vs. long term reward.

Exploration and exploitation

Explore: explore the arms in order to learn about them and improve its chances of getting future reward;

Exploit: focus on the most profitable arm and get the largest reward.

How to balance between explore and exploit mode?

  • Horizon is finite: exploration should decrease as horizon gets closer
  • Horizon is infinite but \gamma < 1: exploration should decrease as the agent's uncertainty about the expected rewards go down
  • Horizon is infinite and \gamma = 1: infinitely delayed splurge, you have infinite future ahead of you, so you always explore and delay gratification infinitely.

To address this we have ---

Action value methods

image

A few different methods:

  1. epsilon-greedy:
    image

  2. softmax exploration: concentrate exploration on most promising arms.
    image

  3. Upper confidence bound:
    Neither epsilon-greedy nor softmax considers uncertainty in action-value estimates, while goal of exploration is to reduce uncertainty. So focus exploration on most uncertain actions. This uses the principle of optimism in the face of uncertainty.

Focus on arms that are promising and uncertain for exploration.

image

Contextual bandit problem

image
So it's a traditional bandit problem, but instead of having Q conditioned on action it conditions on action and state together.

Our action doesn't just affect reward, it also affects the state of the world:
image

The credit-assignment problem: suppose an agent takes a long sequence of actions, at the end of which is receives a large reward. How does it determine to what degree each action in that sequence is responsible for the resulting reward?

Markov decision processes: formalising the RL problem

image
stationary: rule of physics isn't changing
stochastic: a bit random

image

MDP example: recycling robot

image

The Markov property

image

The current state is sufficient for the agent's history, conditioning actions on the rest of the history cannot possibly help. This restricts search to reactive politcy:

image

Is it Markov?

  1. Robot in a maze, state is wall or no wall or 4 sides, action is up, down, left, right: no
  2. Chess, state is board position, action is legal move: yes (minus the special rules)

Return: value function and Bellman equation

We now need to reason about long term consequences, and we can do so by maximising the expected return, which is the sum over the rewards received.
image

Value function: Value functions are the primary tool for reasoning about future reward. It is the expected value of the policy.

image
note that in action-value the action can deviate from the policy (doesn't have to be in the policy)

Bellman equation: we are not sure what the next state will be so we take the expectation over all actions as well. This is commonly known as bootstrapping:
image

The difference between the two equations is whether we bootstrap over state or state-action pair.

To gain some intuitions:
image

Look at the LHS of the tree, corresponding to the first equation:

  • The first extension of the tree, we look at all the action -- first summation
  • The second extension of the tree, we look at all the stochastic outcome -- second summation

Optimal value functions

image

The Bellman optimality equations express this recursively: we replace the Bellman equation's expectation over actions with a maximisation wrt action:

image

A recap -- P: transition function, R: reward function

image

Planning with MDP

MDPs give us a formal model of sequential decision making. Given the optimal value function, computing an optimal policy is straightforward. But how can we find V* or Q*?

Algorithms for MDP planning compute the optimal value function given a complete model of the MDP. Given a model, V* is usually sufficient.

Dynamic programming approach

image

We start off with arbitrary policy (pi), then we compute the true value function for the arbitrary policy (V). Then we use the value function to figure out an incremental improvement for our policy. Repeat the process until you reach optimal point.

We can use the Bellman equation to exploit the relationship between states (instead of estimating each state independently). Initial value function is chosen arbitrarily.

Policy evaluation update rule

image

You estimate by summing up all the current state. Apply to every state in each sweep of the state space. Repeat over many sweeps, it will eventually converge to fixed point where V_k = V^pi. (we get the true value of the arbitrary policy).

image

image

More intuitively:
image

We start from a random point, first perform policy evaluation, which get us to where value is exactly that of the policy (V = V^pi). We perform policy improvement, at which point the value function diverges from true value function. Keeps interating and you converge to the optimal value.

Guaranteed convergence: how?? (i.e. how do you know these two lines are intersecting with each other and you can always get to an optimal point) in here we are doing closed-loop update, which means the action always have the opportunity to condition on the state.

"A counter example": say you take two actions, left and right and you have two timesteps. if these are the true values corresponding to action:

  • LL yields return of 5
  • LR yields return of 0
  • RL yields return of 0
  • RR yields return of 10
    So if you start from LL, will you be stuck in a local maximum?

This is not true because of exactly the closed loop update thing -- we will consider all states of every action, so even when we take the first step to be left, Bellman equation also considers the state of if you had taken a step right. Then when we try to find the optimal action, it will figure out that taking two steps to the right gives the most value.

Value interation

We do not always have to wait for the policy evaluation to complete before doing policy improvement -- so we can do 5 policy updates before doing 1 policy evaluation.

image

Here we take the Bellman optimality equation and turn it into an update rule.

MC methods

MC provides one way to perform reinforcement learning: finding
optimal policies without a priori models of MDP
MC for RL learns from complete sample returns in episodic tasks:
uses value functions but not Bellman equations

image

image

So this is not a tree anymore, we just keep performing rollout and calculate the reward (this is like a depth search vs. the Bellman is a width search)

image

image

We just carry out the entire policy in the real world.

On-policy MC control

image
Caveat: Converges to the best �epsilon-soft policy rather than the best best policy.

Off-policy MC control

To avoid the caveat, we can do off-policy MC, which allows us to have a different estimation policy to behaviour policy. This is done using importance sampling.

image
image
image

The variance depends on the difference between the target policy and the actual policy, if the two policies defer too much then the capacity of the estimation is very limited.

Temporal-difference methods

TD(0): estimation of V

image
image

Pseudo code:
image

Difference between TD(0), MC and DP

image

Advantages

  • TD methods require only experience, not a model
  • TD, but not MC, methods can be fully incremental (we don't have to wait until the end of the episode to do an update)
  • Learn before final outcome: less memory and peak computation
  • Learn without the final outcome: from incomplete sequences
  • Both MC and TD converge but TD tends to be faster

Sarsa: estimation of Q

image

Bootstrapping off state-action pair rather than just the next state.

It's not a policy evaluation algorithm, but an optimisation algorithm. It doesn't try to find Q^pi, it tries to find Q*:

image

Expected Sarsa

Compute the expectation of the action explicitly rather than just taking one sample of it, to reduce variance in updates:
image

Q-learning: off-policy TD control

In MC, we can do on-policy and off-policy, and we can also do that in temporal learning.
image

When we do maximisation over all actions, we are no longer considering the expectation of the policy/taking a sample in the policy, so this is off-policy.
image

Summary: a unified view

image

Kernels (Arthur Gretton)

Kernels

slides

Kernels and feature space

Why

  1. Kernel methods make it possible to separate the XOR cases that cannot be deal with by any linear classifier;
  2. Kernel methods can control smoothness and avoid overfitting/underfitting.

What

Hilbert space: inner product space containing Cauchy sequence limits.
Kernel: a dot product in a Hilbert space (between features, you can use NN to produce a feature, we don't care)

New kernels from old

  1. if a, b are kernels, c is constant, then ca and a+b are both kernels. (proof via positive definiteness)
  2. The difference of two kernels might not be a kernel (because the length of a kernel cannot be negative, but calculating the difference between two kernels permits that)
  3. Products of kernels are kernels
    1,3 --> polynomial kernels! (expands to sums and products)

Exponential quadratic kernel

As long as the algorithm only uses dot product, and the dot product can be written in closed form, the features (the things you take inner product with) can be infinite (but obviously if you use an NN to construct the feature space you will have finite feature space)

Positive definite functions

Kernel validness: positive definite functions! Proof by the positivity of norm in Hilbert space.

This condition is sufficient and necessary: positive definite function is inner product in unique Hilbert space.

The REPRODUCING Hilbert Kernel Space

You can do a linear combination of infinitely many features, but then you need infinitely many coefficients. But that's super inconvenient! Remember how we could avoid infinity by taking dot product.
Solution: instead of using infinitely many coefficient, you can approximate it with a finite number of features (let's say m). Now your infinite number of linear combination between coefficients and features becomes a finite sum of kernels. The choice of m will depend on the problem you're solving -- typically in SVM, m would be 10% of total training points.
So in actual fact kernels is a very simple function:
image

2 defining features of RKHS

  1. The reproducing property (aka kernel trick):

drawing

2. The feature map of every point is a function. (if it's a kernel between x and y, you can consider it as a function of x with the function defined by y)

Understanding smoothness of RKHS

Consider a Fourier Series:

drawing

We want to use Fourier series to approximate the "top hat" function:

drawing

drawing

(in here in the basis function f(x), all the sin disappear because sin is asymmetrical.)

The more cosine you add, the closer the approximation is. The approximation is a sum of features -- Fourier features. This can be written in kernel form as:

drawing

Comparing the dot product of Fourier series vs. in Hilbert space:

drawing

We can see that the Hilbert space one has a squared norm k term, which enforces smoothness. This is because if we write it as dot product between one kernel and itself, we have:

drawing

Using this principle we can determine that the top hat function is not smooth enough (numerator decay at polynomial, denominator at exponential, the norm is going to blow up) so it can never be a kernel.

In short, the core property of RKHS kernel is its smoothness enforced by small norm.

Reproducing property:

drawing

drawing

A unique property of RKHS

drawing

The L2 Hilbert space does not have this property. This is equivalent to the positive definite condition when defining RKHS.

MMD and GANs

slides

Maximum mean discrepancy

drawing

drawing

Illustration

drawing

MMD as an integral probability metric

drawing

A "well behaved" (smooth) function to maximise the mean discrepency. But when two distributions are not too different, we'd have:

drawing

Which is still not too bad because P and Q are very similar.

drawing

drawing

How does MMD "maximises the mean discrepency"?

drawing

f is optimal (maximised) when it's in the same direction as mean discrepancy.

Divergences

Question: do we look at the difference or the ratio?

drawing

drawing

Two sample testing with MMD

Statistical testing language

drawing

Example: idependent P and Q

Draw n=200 i.i.d. samples from P and Q

drawing

Repeat this process 300 times, get 150 MMDs, this is the histogram you get for MMD:

drawing

Looks like a Gaussian! This can be proven, when P and Q are independent, MMD looks like a Gaussian. (P and Q can have the same mean and different variance, it'll still look like a Gaussian. The variance of Gaussian will depend on the kernel of choice)

drawing

Example: P and Q are the same

drawing

Side note: notice from above, MMD can be negative! Because you're calculating the discrepancy between two samples with the same mean, MMD is an unbiased estimation of a distribution with zero mean, therefore MMD can be negative sometimes

drawing

It's going to be an infinite weighted sum of chi-squared (weighted with eigenvalues) with zero mean.

Summary

drawing

How to get test threshold c?

drawing

Now we permute the dataset:

drawing

And this is how we get the threshold (cut off at a point where MMD is "small enough"). It's a quantile of null distribution.

Choosing kernels

maximising test power == minimising false negatives

drawing

we want the blue area to be as small as possible.

drawing

To compute the second term is extremely hard, its a function of our kernel in a way that is extremely difficult (eigenvalues, chi squred, blah)
But, luckily:

drawing

So, to maximise test power, we just need to maximise the first term. (check out code)

GAN: real vs. fake

Distinguishing from real to fake digits:

drawing

Our kernel has a bandwidth for each pixel, so that it ignores the unimportant pixels and focus on the important ones.

drawing

So while humans are not able to distinguish the differences, statistical tests are very confident.

Tutorial take-away

See tutorial and solution

In traditional t-testing we have to take multiple sample to say with confidence that two statistics are different, but with mmd we can tell with only two samples.

drawing

drawing

GAN: training

Recap: previously we talked about 2 different types of distance metrics, integral prob (Wasserstein, MMD) and F-divergences (KL).

F-divergence as critic

Unhelpful, the following two casesboth have D_js = log2

drawing

drawing

In practice, we:

  1. Use variational approximation to the critic, alternate generator and critic training
  2. Add instance noise to the reference and generator observations (or a gradient penalty for the variational critic)

Wasserstein as critic

Helpful, the distance decreases when we get closer to the real distribution

MMD as critic

Wide kernels are helpful, narrow kernels are unhelpful:

drawing

drawing

drawing

drawing

To use MMD as a GAN critic, we need features that are "image specific", because MMD itself doesn't really know what an image looks like -- so we use NN!

drawing

How to train?
Reminder: witness function is acquired by subtracting the red points from the blue points

drawing

So now when we push input through NN first, look at the witness function:

drawing

There are lots of distortions!

MMD: helpful vs. unhelpful example

If the MMD gives a powerful test (can accurately separate blue and red) it will not be a good critic. To see this, let's look at a 2D example.

drawing

The gradient when using kernels k(x, y):

drawing

kernel:

drawing

This is when MMD is helpful: there are gradients everywhere so we always know which way to optimise. Now we can have a look at the gradient when using CNN features, kernels k(h(x), h(y)):

drawing

Kernels:

drawing

This is not helpful! The red points generally don't know where to move the gradients, test power is too strong.

What to do with MMD? Regularisation!

We can add a gradient regulariser (an approximation of it because the exact form is too computationally expensive O(n^3)):

drawing

This is now a bit more helpful:

Early stages: diffused gradient, tells you where to move your distribution

drawing

drawing

Late stages: concentrated gradient

drawing

drawing

The regulariser enforce the kernels to take the shape of the target distribution, so that the kernels can move around and capture the target distribution.

Observations on GAN

To have a successful GAN, you don't use an off the shelf critic function, you need a data-specific gradient estimator. This is true for the MMD critic we just developed, it's also true for the traditional GAN with KL divergence (it's not the KL or the MMD that is working, it's the data dependant regularisation)

DO NOT TRAIN CRITIC TO CONVERGENCE

WGAN-GP

drawing

drawing

Evaluation of GAN: inception score

Based on the classification output p(y|x) of the inception model:

drawing

High when predictive label distribution p(y|x) has low entropy and when label entropy p(y) is high, but it basically can't be used if you don't have the specific labels of your generation. It also relies on a trained classifier, can't be used on new categories.

Evaluation of GAN: Frechet inception distance (FID)

drawing

It's not gonna go to zero even if you calculate the FID between two sets of real images. Experiments found that you need 100000 samples for the ordering of FID to be truthful of the image quality.

Evaluation of GAN: kernel inception distance (KID)

drawing

Dependence testing

slides

To find the correlation between text and image, we can use paired image-caption as P, randomised image-caption as Q, calculate MMD(P, Q) and use as a dependence measure.

drawing

Next question -- what kernel should we use? We can use kernel k on image feature space, kernel l on sentence feature space, then the kernel we use for correlation analysis can just be the product between the two and take the trace. If the images and captions are correlated the trace is going to be large.

drawing

drawing

BUT -- this is not very interpretable. Also taking the product between the kernels is quite random.
Dependence is not correlation: they can be dependent on each other, but when the dependence is not linear, the correlation can be very low.

drawing

drawing

But, you can always find a smooth transformation of the data and find a high correlation between the two.

drawing

To this end, we can define Hilbert Schmidt independence criterion (HSIC):
We can find a range of smooth functions (all orthogonal to each other) to find correlations and add the squared values of the correlations together.

drawing

Optimal Transport (Marco Cuturi)

Optimal transport

(see lecture notes and tutorial)

Introduction

Two examples:

  • Moving earth & Soldiers Monge problem: what is the most efficient way to bring earth from one place to another

drawing

How to move the sand to fill the hole most efficiently? Characterise the work involved here by the product between mass and the moving distance:

drawing

drawing

drawing

Exact solution: linear programming

image

Sinkhorn Algorithm for Entropy Regularized Optimal Transport

image

Dealing with curse of high dimensionality

Sliced Wasserstein distance:

drawing

PCA projection

drawing

k-dim (robust) projection

drawing

Applications: Average measures

k-means

You can consider W distance as a k-mean algorithm, if the dimension of X is much higher than the dimension of Y.

Wasserstein Barycenter:

drawing

drawing

Brain imaging

Mapping visual stimulus to different cortex of the brain using MEG. Different subject will have slightly different response -- to account for this spatial variation one can use Wasserstein average.

KL, MMD vs. Wasserstein

there is no geometry in KL/MMD, they're better for high dimensional data. For KL, when you have two Gaussians very close to each other, if the variance goes to zero, the divergence can go to infinity

Wasserstein vs. L2 averages

drawing

Domain adaptation

drawing

Learning with Wasserstein loss:

drawing

drawing

drawing

Sorting

On 1D, calculating the Wasserstein plan is equivalent to sorting (because the nth ranking point in X is always going to map to the nth ranking in Y)

drawing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.