Basic steps of "Bayesian Machine Learning" ①¶
Here I'll share basic steps of "Bayesian Machine Learning" . To put it simply, the steps of "Bayesian Machine Learning" consists of 4 steps as followings where $D$ is observed data and $\theta$ is unknown parameter.
- Create "model" of joint probability $p(D,\theta)$ as bellow.
$$p(D,\theta) = p(D|\theta)p(\theta)$$
This means setting how the data would be generated. $p(D|\theta)$ is called likelihood function. - Apply prior distribution to $\theta$.
If you apply idea of conjugate distribution(*1), calculation for posterior distribution and predictive distribution will be effortless.
(*1) "Conjugate distribution" :
If the posterior distribution $p(\theta|D)$ is in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior distribution are called conjugate distribution, and prior distribution is called "conjugate prior" for the likelihood function. sorce : Conjugate prior - Compute posterior distribution $p(\theta|D)$.
$$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}$$
In machine leaning, this process is called "training". Calculate predictive distribution.
$p(x_*| D) = \int p(x_*|\theta)p(\theta|D) d\theta $ (*2)
Incidentally, you can predict without any observed value as following.
$p(x_*) = \int p(x_*|\theta)p(\theta) d\theta = \int p(x,\theta) d\theta$(*2)
$$\begin{equation}p(x_*| D) = \int p(x_*,\theta|D) d\theta \\ =\int \frac{p(x_*, \theta,D)}{p(D)} d\theta \\ = \int \frac{p(x_*|\theta)p(D|\theta)p(\theta)}{p(D)}d\theta \\ = \int p(x_*|\theta)p(\theta|D) d\theta\end{equation} $$
Example of training with bernoulli distribution¶
1. Mathmatical analysis¶
In this example, I assume that liklihood function follows bernoulli distribution. In the case, parameter $\mu$ of bernoulli distribution is required to follow $\mu \in (0,1)$. Therefore adopting beta distribution as posterior distribution is understandable :)
Let us assume $X(x_1,x_2,・・・x_n)$ is observed, posterior distribution looks like as bellow.
$$\begin{equation} p(\mu|D) = \frac{p(X|\mu)p(\mu)}{p(X)}\end{equation}$$
Since $X(x_1,x_2,・・・x_n)$ is independent,
$$\begin{equation} p(\mu|D) = \frac{\Pi^{n}_{i=1}p(x_i | \mu)p(\mu)}{p(X)}\end{equation}$$
Take the logarithm for both sides,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1}\log p(x_i | \mu) + \log p(\mu) + const\end{equation}$$
Apply bernoulli distribution and beta distribution,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1} \log \mu^{x_i}(1-\mu)^{(1-x_i)} + \log C_B (a,b)\mu^{a-1}(1-\mu)^{b-1} + const \\ = (a-1 +\Sigma_{i=1}^{n}x_i)\log\mu + (b-1+\Sigma_{i=1}^{n}(1-x_i))\log(1-\mu) + const\end{equation}$$
Then, compare with logarithm of Beta distribution.
$$\log Beta(x|a,b) = (a-1)logx +(b-1)log(1-x) + logC_B(a,b)$$
Therefore,
$$p(\mu|D) = Beta(\mu|\hat{a},\hat{b})\ where\ \hat{a} = a + \Sigma_{i=1}^{n}x_i,\ \hat{b} = b+\Sigma_{i=1}^{n}(1-x_i) $$
From this equation, we can tell that Beta distribution of posterior distribution is updated, then frequency of 1 (ex. head in coin toss) was added to parameter "a", frequency of 0 (ex. tail in coin toss) was added to parameter b from observed data (likelihood function).
2. Implementation with Python ¶
Here I try to show how prior distribution is gonna be updated with observed data (likelihood function).
Regarding Bernoulli distribution and Beta distribution, there are useful library "scipy.stats.bernoulli" and "scipy.stats.beta" respectively.
from scipy.stats import bernoulli,beta
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
#Tentatively, let us assume prior distribution as Beta(x|1,1).
x = np.linspace(0,1,100)
beta_prior = beta(1,1)
y_prior = beta_prior.pdf(x)
plt.plot(x,y_prior)
To sum it up, we don't have any preliminary information regarding event of bernoulli at all. How unpredictable the event is !
Let us assume that true parameter of bernoulli distribution of event is 0.8. After observing data, we'll see how the prior distribution will be change.
# create 10 data from bernoulli distribution with parameter is 0.8.
num_of_observe = [5,25,50,100]
for num in num_of_observe:
data_ber = bernoulli.rvs(0.8, size=num,random_state=4)
head_num = data_ber.sum()
tail_num = data_ber.shape[0] - head_num
beta_pos = beta(1+head_num,1+tail_num)
y_pos = beta_pos.pdf(x)
plt.plot(x,y_pos,label='{} observed'.format(num))
plt.legend()
plt.title('Transition of "perior distribution"')
You can see perior distribution approaches true parameter of 0.8 :)
No comments:
Post a Comment