Processing math: 100%

Tuesday, May 8, 2018

Basic steps of "Bayesian Machine Learning" ①

Basic steps of "Bayesian Machine Learning" ①

Here I'll share basic steps of "Bayesian Machine Learning" . To put it simply, the steps of "Bayesian Machine Learning" consists of 4 steps as followings where D is observed data and θ is unknown parameter.

  1. Create "model" of joint probability p(D,θ) as bellow.
    p(D,θ)=p(D|θ)p(θ)

    This means setting how the data would be generated. p(D|θ) is called likelihood function.
  2. Apply prior distribution to θ.
    If you apply idea of conjugate distribution(*1), calculation for posterior distribution and predictive distribution will be effortless.
    (*1) "Conjugate distribution" :
    If the posterior distribution p(θ|D) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior distribution are called conjugate distribution, and prior distribution is called "conjugate prior" for the likelihood function. sorce : Conjugate prior
  3. Compute posterior distribution p(θ|D).
    p(θ|D)=p(D|θ)p(θ)p(D)

    In machine leaning, this process is called "training".
  4. Calculate predictive distribution.
    p(x|D)=p(x|θ)p(θ|D)dθ (*2)
    Incidentally, you can predict without any observed value as following.
    p(x)=p(x|θ)p(θ)dθ=p(x,θ)dθ

    (*2)
    p(x|D)=p(x,θ|D)dθ=p(x,θ,D)p(D)dθ=p(x|θ)p(D|θ)p(θ)p(D)dθ=p(x|θ)p(θ|D)dθ

Example of training with bernoulli distribution

1. Mathmatical analysis

In this example, I assume that liklihood function follows bernoulli distribution. In the case, parameter μ of bernoulli distribution is required to follow μ(0,1). Therefore adopting beta distribution as posterior distribution is understandable :)
Let us assume X(x1,x2,xn) is observed, posterior distribution looks like as bellow. p(μ|D)=p(X|μ)p(μ)p(X)


Since X(x1,x2,xn) is independent, p(μ|D)=Πni=1p(xi|μ)p(μ)p(X)

Take the logarithm for both sides,
p(μ|D)=Σni=1logp(xi|μ)+logp(μ)+const
Apply bernoulli distribution and beta distribution,
p(μ|D)=Σni=1logμxi(1μ)(1xi)+logCB(a,b)μa1(1μ)b1+const=(a1+Σni=1xi)logμ+(b1+Σni=1(1xi))log(1μ)+const

Then, compare with logarithm of Beta distribution.
logBeta(x|a,b)=(a1)logx+(b1)log(1x)+logCB(a,b)

Therefore, p(μ|D)=Beta(μ|ˆa,ˆb) where ˆa=a+Σni=1xi, ˆb=b+Σni=1(1xi)

From this equation, we can tell that Beta distribution of posterior distribution is updated, then frequency of 1 (ex. head in coin toss) was added to parameter "a", frequency of 0 (ex. tail in coin toss) was added to parameter b from observed data (likelihood function).

2. Implementation with Python 

Here I try to show how prior distribution is gonna be updated with observed data (likelihood function).
Regarding Bernoulli distribution and Beta distribution, there are useful library "scipy.stats.bernoulli" and "scipy.stats.beta" respectively.

In [1]:
from scipy.stats import bernoulli,beta
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
In [2]:
#Tentatively, let us assume prior distribution as Beta(x|1,1).
x = np.linspace(0,1,100)
beta_prior = beta(1,1)
y_prior = beta_prior.pdf(x)
plt.plot(x,y_prior)
Out[2]:
[<matplotlib.lines.Line2D at 0x11ce039b0>]

To sum it up, we don't have any preliminary information regarding event of bernoulli at all. How unpredictable the event is !
Let us assume that true parameter of bernoulli distribution of event is 0.8. After observing data, we'll see how the prior distribution will be change.

In [3]:
# create 10 data from bernoulli distribution with parameter is 0.8.
num_of_observe = [5,25,50,100]
for num in num_of_observe:
    data_ber = bernoulli.rvs(0.8, size=num,random_state=4)
    head_num = data_ber.sum()
    tail_num = data_ber.shape[0] - head_num
    beta_pos = beta(1+head_num,1+tail_num)
    y_pos = beta_pos.pdf(x)
    plt.plot(x,y_pos,label='{} observed'.format(num))
    
plt.legend()
plt.title('Transition of "perior distribution"')
Out[3]:
<matplotlib.text.Text at 0x11f390dd8>

You can see perior distribution approaches true parameter of 0.8 :)

No comments:

Post a Comment