In Bayesian probability theory, if the posterior distributions p(θ | X) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Here we shall look at the case of bernoulli distribution and see how we can use a beta prior in this case, with quick references to other use cases.
You will be able to:
- Understand and describe what a conjugate prior is
- Justify the use of a beta prior distribution for bernoulli experiments like a simple coin toss
- Get an overview of a number of different data distributions and their conjugate priors
So now we know that based on Bayes theorem, we can rewrite
Let's think again of our coin toss example, where we want to learn weather coin is fair or not.
The distribution associated with
$P(\theta)$ is known as conjugate prior if distribution of$P(\theta)$ has the same form as of$P(\theta|X)$
For bernoulli distribution, the conjugate prior is given by Beta distribution which takes two parameters $(\alphaand\beta)$ and the probability density function of the Beta distribution is given as:
Here
For different values of $\alphaand\beta$, we get different shapes of beta distribution each of which peaks at
To check if and\beta$ and find where
Now We can simplify the above equation.
- We can get rid of the denominator as it is only for normalization purpose.
- We take the log sum instead of product to avoid the number overflow problem.
As we saw in the MLE lesson, now we take the derivative of the above equation with respect to
Since,$ n^1 = 7andn^0=3$ and if we use
The following diagram (from John D. Cook's website) summarizes conjugate prior relationships for a number of common sampling distributions.
Arrows point from a sampling distribution to its conjugate prior distribution. The symbol near the arrow indicates which parameter the prior is unknown. These relationships depend critically on choice of parameterization, some of which are uncommon. Following quick reference list highlights PDFs and PMF functions for some major distributions (you dont have to understand all of these at this stage) along with their parameters and posterior distributions.
Let C(n, k) denote the binomial coefficient(n, k).
-
The Bernoulli distribution has probability of success
$p$ -
The beta distribution has PDF:
$f(p) = Γ(α + β) pα-1(1-p)β-1 / (Γ(α) Γ(β))$ -
The geometric distribution has only one parameter, p, and has PMF:
$f(x) = p (1-p)x$ -
The binomial distribution with parameters n and p has PMF:
$f(x) = C(n, x) px(1-p)n–x$ -
The negative binomial distribution with parameters r and p has PMF:
$f(x) = C(r + x – 1, x) pr(1-p)x$ -
The exponential distribution parameterized in terms of the rate λ has PDF:
$f(x) = λ exp(-λ x)$ -
The gamma distribution parameterized in terms of the rate has PDF:
$f(x) = βα xα-1exp(-β x) / Γ(α)$ -
The Poisson distribution has one parameter λ and PMF
$f(x) = exp(-λ) λx/ x!$ -
The normal distribution parameterized in terms of precision
$τ (τ = 1/σ2)$ has PDF:$$f(x) = (τ/2π)1/2 exp( -τ(x – μ)2/2 )$$
For each sampling distribution, assume we have data
-
If the sampling distribution for x is binomial(m, p) with m known, and the prior distribution is beta(α, β), the posterior distribution for p is
$beta(α + Σxi, β + mn – Σx_i)$ . The Bernoulli is the special case of the binomial with m = 1. -
If the sampling distribution for x is negative binomial(r, p) with r known, and the prior distribution is beta(α, β), the posterior distribution for p is
$beta(α + nr, β + Σxi)$ . The geometric is the special case of the negative binomial with r = 1. -
If the sampling distribution for x is gamma(α, β) with α known, and the prior distribution on β is gamma(α0, β0), the posterior distribution for β is
$gamma(α0 + nα, β0 + Σxi)$ . The exponential is a special case of the gamma with α = 1. -
If the sampling distribution for x is Poisson(λ), and the prior distribution on λ is gamma(α0, β0), the posterior on λ is
$gamma(α0 + Σxi, β0 + n)$ . -
If the sampling distribution for x is normal(μ, τ) with τ known, and the prior distribution on μ is normal(μ0, τ0), the posterior distribution on μ is
$normal((μ0 τ0 + τ Σxi)/(τ0 + nτ), τ0 + nτ)$ . -
If the sampling distribution for x is normal(μ, τ) with μ known, and the prior distribution on τ is gamma(α, β), the posterior distribution on τ is
$gamma(α + n/2, (n-1)S2)$ where S2 is the sample variance. -
If the sampling distribution for x is lognormal(μ, τ) with τ known, and the prior distribution on μ is normal(μ0, τ0), the posterior distribution on μ is
$normal((μ0 τ0 + τ Πxi)/(τ0 + nτ), τ0 + nτ)$ . -
If the sampling distribution for x is lognormal(μ, τ) with μ known, and the prior distribution on τ is gamma(α, β), the posterior distribution on τ is
$gamma(α + n/2, (n-1)S2)$ where S2 is the sample variance.
A compendium of Conjugate prios - A reference document for details on selecting a suitable prior for a given distribution.
In this lesson we looked at the idea of using a conjugate priors for Bayesian inferencing. These section might be getting a bit heavy on the mathematics, without showing great details on the applications of Bayes in real life scenarios, other than coin toss experiments. The understanding from this chapter will be used throughout the machine learning sections of our course. And you will have ample chance to revisit these equations for a deeper dive. At this stage , we expect you to have a conceptual understanding of the Bayesian Inference process. Let's move on to our last lab to run some simulations around Bayesian inferencing, using the good old coin toss examples.