Premise of GANs

A GAN takes a random sample from a latent or prior distribution as input and maps it to the data space. The task of training is to learn a deterministic function that can efficiently capture the dependencies and patterns in the data so that the mapped point resembles a sample generated from the data distribution.

Example:

I have generated 300 samples from Isotropic Bivariate Gaussian distribution.

When passed through a function , the points form a ring, which demonstrates that there could be a high capacity function that may be able to model data distribution of high dimensional data like images. Neural networks are out best bet as they are universal functional approximators. Hence, deep neural networks are used while modeling data distribution of images. Unlike MLE or KDE, this is an implicit density estimation

Probability Review

frequency table and joint distribution of two discrete random variables

Conditional Distribution

In the above table, fix the value of a random variable, say x = x_1; the distribution of y when x = x_1 is called conditional distribution, $P(y | X=x_{1})$ . Conditional expectation is expectation of the conditional distribution.

In the above table, the conditional probability of y_1 given X=x_1 is 2/17

Marginal Distribution

Integrate or summate over a variable, to get the marginal distribution of another variable. $P(X = x_{i}) = \sum_{y=1}^{n} P(X = x_{i}, Y = y_{i})$

In the above table, the marginal probability of x_1 according to above formula is 2/50 + 10/50 + 5/50 = 17/50.

Joint Distribution

A join distribution a.k.a data distribution captures the joint probabilities between random variables. In the above table, the join probability of P(X = x_2 & Y = y_3) is 2/50. This is what a GAN tries to model from the sample data.

Consider images of size 28 x 28. Each pixel is a random variable that can take any value from 0 to 255. Hence, we have 784 random variables in total. GAN tries to model the dependencies between the pixels.

Bayes Theorm

$P( A | B) = \frac{P(A \bigcap B)}{P(B))}$

$Conditional Probability = \frac{Joint Probability}{Marginal Probability}$

From the above table: P(Y = y_1 | X = x_1) = P(Y = y_1 & X = x_1) / P(X = x_1) = (2/50)/(17/50) = 2/17

Entropy

Entropy measures the degree of uncertainty of an outcome of a trial according to a p(x).

$H(p) = -\sum_{k=1}^{K}p_{i}\log p_{i}$

The entropy of an unbiased coin is higher than a biased coin. The difference in entropy increases with the degree of polarization of probabilities of the biased coin.

Cross Entropy

Cross entropy measures the degree of uncertainty of a trial according to p(y) but in truth according to p(x).

$H(p(x), p(y)) = -\sum_{k=1}^{K}p(x_{k})\log p(y_{k})$

Cross entropy is higher when a trial is conducted according to unbiased coin probability distribution but you think it is being conducted according to the biased coin probability distribution.

KL Divergence

KL Divergence is the difference between cross-entropy and entropy of the true distribution. KL Divergence is equal to zero when two distributions are equal. Hence, when you want to approximate or model a probability distribution with other, minimizing the KL Divergence between them will make them similar.

$D_{KL}(p || q) = H(p, q) - H(p) = \sum_{k=1}^{K} p_{k}\log \frac{p_{k}}{q_{k}}$

D(fair||biased) = H(fair, biased) – H(fair) = 2.19-1 = 1.19

JS Divergence

Due to division by a probability of an outcome in KL Divergence equation, it may become intractable in some cases. For example, if q_k is zero, KL Divergence becomes infinite. Moreover, KL Divergence is not symmetric i.e., D(p||q) is not equal to D(q||p), which makes it unusable as a distance metric. To suppress these effects, JS divergence uses avg probability of an outcome.

$D_{JS}(p || q) = \frac{1}{2}\ D_{KL}(p || \frac{p + q}{2}) + \frac{1}{2}\ D_{KL}(q || \frac{p + q}{2})$

GANs

As aforementioned, GANs take a random sample from the latent space as input and maps it to data space. In DCGANs, the mapping function is a deep neural network, which is differentiable and parameterized by network weights. The mapping function is called Generator(G). A Discriminator(D) is also a deep neural network that takes a sample in the data space and maps it to the action space i.e., the probability of that sample being generated from the data distribution.

$P_{z}$ : Prior / latent distribution. Typically, this space is much smaller than the data space.

$P_{g}$ : Data distribution of data generated by the generator

$P_{r}$ : Real data distribution

$\small \widehat{y_{i}} = D(x_{i})$

$\small g_{i} = G(z_{i})$

D__m = (d_1, d_2, d_3, ... d_m) be the data generated according to P_r

G__n = (g_m+1, g_m+2, .. g_n) be the data generated according to P_g

Train D to minimize the empirical loss. I am including min functions, as most deep learning frameworks only implement, minimization of a function.

$L_{d} = \max_{\theta _{d}}\sum_{i=1}^{n}\1(x_{i} \epsilon D_{m})log \widehat{y}_{i} + 1(x_{i}\epsilon G_{n}) \log (1- \widehat{y}_{i})) = \min_{\theta _{d}}- \sum_{i=1}^{n}\1(x_{i} \epsilon D_{m})log \widehat{y}_{i} + 1(x_{i}\epsilon G_{n}) \log (1- \widehat{y}_{i})) \rightarrow 1$

Fix the D network, and train G to maximize the loss of D over G_n.

$\small L_{g} = \max_{\theta _{g}} - \sum_{i=1}^{n}\ 1(x_{i}\epsilon G_{n}) \log (1- \widehat{y}_{i})) = \min_{\theta _{g}} \sum_{i=1}^{n}\ 1(x_{i}\epsilon G_{n}) \log (1- \widehat{y}_{i}))$

As stated in the original paper, in the early training period, the above loss doesn't offer enough gradient to update the parameters of G network, as initially P_g is distant from P_d, which makes it easy for D to classify generated images. Hence, we try to maximize it by switching labels.

$\small L_{g} =\min_{\theta _{g}} \sum_{i=1}^{n}\ 1(x_{i}\epsilon G_{n}) \log (1- \widehat{y}_{i})) = \max_{\theta _{g}} \sum_{i=1}^{n}\ 1(x_{i}\epsilon G_{n}) \log (\widehat{y}_{i})) = \min_{\theta _{g}} - \sum_{i=1}^{n}\ 1(x_{i}\epsilon G_{n}) \log (\widehat{y}_{i})$

Optimization And Theoritical Results

Optimal Discriminator for fixed `G`

Equation 1 is an empiracal loss function. Its risk function or loss on the whole population i.e., for every possible image can be written as:

$\small L_{d}^{r}= E_{x\sim p_{r}}\log \widehat{y}\: + E_{x\sim p_{g}}\log 1- \widehat{y} = \int_{x}^{.} p_{r}(x)\log \widehat{y}\, + p_{g}(x)log(1\, -\, \log(\widehat{y})) dx$

$\small \frac{\mathrm{d} L_{d}^{r}}{\mathrm{d} \widehat{y}} = 0$

$\small \frac{P_{r}(x)}{\widehat{y}} - \frac{P_{g}(x)}{1-\widehat{y}} = 0$

$\widehat{y}^{*} = \frac{P_{r}(x)}{P_{r}(x) + P_{g}(x)} \rightarrow 2$

$\small L_{d}^{r^{*}} = \int_{x}^{.} p_{r}(x)\log (\frac{P_{r}(x)}{P_{r}(x) + P_{g}(x)}), + p_{g}(x)log(1\, -\, \frac{P_{r}(x)}{P_{r}(x) + P_{g}(x)}) dx$

$\small L_{d}^{r^{*}} = L(G, D^{*}) = \int_{x}^{.} p_{r}(x)\log (\frac{P_{r}(x)}{P_{r}(x) + P_{g}(x)}), + p_{g}(x)log( \frac{P_{g}(x)}{P_{r}(x) + P_{g}(x)}) dx \rightarrow 3$

So when y_hat = y_hat*, the discriminator is at its minimum. At the end of the training, if G does a good job at approximating P_r, then P_g ~ P_r.

$\because \, P_{r}(x) \sim P_{g}(x),\, \, \, \, \, \widehat{y}^{*} = 1/2$

substituting it in equation 3 gives the optimal loss of the discriminator at the end of the training.

$L(G^{*}, D^{*}) = -2\log 2 \rightarrow 4$

This is the cost is obtained when both D and G are perfectly optimized.

From the JS divergence equation, the JS divergence between P_g and P_r is

$\small D_{JS}(P_{r} || P_{g}) = \frac{1}{2}\ D_{KL}(P_{r} || \frac{P_{r} + P_{g}}{2}) + \frac{1}{2}\ D_{KL}(P_{g} || \frac{P_{r} + P_{g}}{2})$

$\small 2D_{JS}(P_{r} || P_{g}) = \int_{x}^{.} p_{r}(x)\log (\frac{2P_{r}(x)}{P_{r}(x) + P_{g}(x)}), + \int_{x}^{.}p_{g}(x)log( \frac{2P_{g}(x)}{P_{r}(x) + P_{g}(x)}) dx$

From equation 3,

$\small 2D_{JS}(P_{r} || P_{g}) = 2\log 2\, \, +\, \, L(G, D^{*})$

$\small 2D_{JS}(P_{r} || P_{g}) - 2\log 2\, \, =\, \, L(G, D^{*})$

The JS divergence is positive semidefinite. Hence, for the value of the above equation to be equal to the value calculated in 4, the JS divergence should be 0 i.e., P_g = P_r. To conclude, when the D is at its best, G need to make P_g ~ P_r to reach the global optimum.

ajitsamudrala / gan Goto Github PK

gan's Introduction

Premise of GANs

Example:

Probability Review

Conditional Distribution

Marginal Distribution

Joint Distribution

Bayes Theorm

Entropy

Cross Entropy

KL Divergence

JS Divergence

GANs

Optimization And Theoritical Results

Optimal Discriminator for fixed `G`

gan's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs