Tuesday, May 8, 2018

Basic steps of "Bayesian Machine Learning" ①

Basic steps of "Bayesian Machine Learning" ①

Here I'll share basic steps of "Bayesian Machine Learning" . To put it simply, the steps of "Bayesian Machine Learning" consists of 4 steps as followings where $D$ is observed data and $\theta$ is unknown parameter.

  1. Create "model" of joint probability $p(D,\theta)$ as bellow.
    $$p(D,\theta) = p(D|\theta)p(\theta)$$
    This means setting how the data would be generated. $p(D|\theta)$ is called likelihood function.
  2. Apply prior distribution to $\theta$.
    If you apply idea of conjugate distribution(*1), calculation for posterior distribution and predictive distribution will be effortless.
    (*1) "Conjugate distribution" :
    If the posterior distribution $p(\theta|D)$ is in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior distribution are called conjugate distribution, and prior distribution is called "conjugate prior" for the likelihood function. sorce : Conjugate prior
  3. Compute posterior distribution $p(\theta|D)$.
    $$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}$$
    In machine leaning, this process is called "training".
  4. Calculate predictive distribution.
    $p(x_*| D) = \int p(x_*|\theta)p(\theta|D) d\theta $ (*2)
    Incidentally, you can predict without any observed value as following.
    $p(x_*) = \int p(x_*|\theta)p(\theta) d\theta = \int p(x,\theta) d\theta$

    (*2)
    $$\begin{equation}p(x_*| D) = \int p(x_*,\theta|D) d\theta \\ =\int \frac{p(x_*, \theta,D)}{p(D)} d\theta \\ = \int \frac{p(x_*|\theta)p(D|\theta)p(\theta)}{p(D)}d\theta \\ = \int p(x_*|\theta)p(\theta|D) d\theta\end{equation} $$

Example of training with bernoulli distribution

1. Mathmatical analysis

In this example, I assume that liklihood function follows bernoulli distribution. In the case, parameter $\mu$ of bernoulli distribution is required to follow $\mu \in (0,1)$. Therefore adopting beta distribution as posterior distribution is understandable :)
Let us assume $X(x_1,x_2,・・・x_n)$ is observed, posterior distribution looks like as bellow. $$\begin{equation} p(\mu|D) = \frac{p(X|\mu)p(\mu)}{p(X)}\end{equation}$$
Since $X(x_1,x_2,・・・x_n)$ is independent, $$\begin{equation} p(\mu|D) = \frac{\Pi^{n}_{i=1}p(x_i | \mu)p(\mu)}{p(X)}\end{equation}$$
Take the logarithm for both sides,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1}\log p(x_i | \mu) + \log p(\mu) + const\end{equation}$$ Apply bernoulli distribution and beta distribution,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1} \log \mu^{x_i}(1-\mu)^{(1-x_i)} + \log C_B (a,b)\mu^{a-1}(1-\mu)^{b-1} + const \\ = (a-1 +\Sigma_{i=1}^{n}x_i)\log\mu + (b-1+\Sigma_{i=1}^{n}(1-x_i))\log(1-\mu) + const\end{equation}$$
Then, compare with logarithm of Beta distribution.
$$\log Beta(x|a,b) = (a-1)logx +(b-1)log(1-x) + logC_B(a,b)$$
Therefore, $$p(\mu|D) = Beta(\mu|\hat{a},\hat{b})\ where\ \hat{a} = a + \Sigma_{i=1}^{n}x_i,\ \hat{b} = b+\Sigma_{i=1}^{n}(1-x_i) $$
From this equation, we can tell that Beta distribution of posterior distribution is updated, then frequency of 1 (ex. head in coin toss) was added to parameter "a", frequency of 0 (ex. tail in coin toss) was added to parameter b from observed data (likelihood function).

2. Implementation with Python 

Here I try to show how prior distribution is gonna be updated with observed data (likelihood function).
Regarding Bernoulli distribution and Beta distribution, there are useful library "scipy.stats.bernoulli" and "scipy.stats.beta" respectively.

In [1]:
from scipy.stats import bernoulli,beta
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
In [2]:
#Tentatively, let us assume prior distribution as Beta(x|1,1).
x = np.linspace(0,1,100)
beta_prior = beta(1,1)
y_prior = beta_prior.pdf(x)
plt.plot(x,y_prior)
Out[2]:
[<matplotlib.lines.Line2D at 0x11ce039b0>]

To sum it up, we don't have any preliminary information regarding event of bernoulli at all. How unpredictable the event is !
Let us assume that true parameter of bernoulli distribution of event is 0.8. After observing data, we'll see how the prior distribution will be change.

In [3]:
# create 10 data from bernoulli distribution with parameter is 0.8.
num_of_observe = [5,25,50,100]
for num in num_of_observe:
    data_ber = bernoulli.rvs(0.8, size=num,random_state=4)
    head_num = data_ber.sum()
    tail_num = data_ber.shape[0] - head_num
    beta_pos = beta(1+head_num,1+tail_num)
    y_pos = beta_pos.pdf(x)
    plt.plot(x,y_pos,label='{} observed'.format(num))
    
plt.legend()
plt.title('Transition of "perior distribution"')
Out[3]:
<matplotlib.text.Text at 0x11f390dd8>

You can see perior distribution approaches true parameter of 0.8 :)

No comments:

Post a Comment