Saturday, June 16, 2018

Linear regression with bayesian inference②

This article is continuation from Linear regression with bayesian inference①. In this article, I will share with you about "training" of prior distribution and "predictive model" of "linear regression".

0. Mathmatical analysis of posterior distribution

For the sake of training of parameter $w$, what we wanna know is following probability density distribution.
$$\begin{eqnarray}p(w|Y,X) &=& \frac{p(Y,X|w)p(w)}{p(Y,X)}\\ &=& \frac{p(w)p(Y|X,w)p(X)}{p(Y|X)p(X)}\\ &=& \frac{p(w)p(Y|X,w)}{p(Y|X)}\\ &=& \frac{N(w|m,\Lambda^{-1})\Pi^{N}_{n}N(y_n|w^Tx,\lambda^{-1})}{p(Y|X)}\\ \end{eqnarray}$$
In order to know what shape the posterior distribution is, we're gonna take logarithm on both side, $$\begin{eqnarray}\log p(w|Y,X) &=&\log\left( \frac{1}{\sqrt{2\pi\Lambda^{-1}}}exp \left\{-\frac{1}{2}(w-m)^T\Lambda(w-m)\right\}\right) + \sum_n^N\log \left(\frac{1}{\sqrt{2\pi\lambda^{-1}}}exp\left\{-\frac{1}{2}\lambda(y_n - w^Tx_n)^2\right\}\right) +\ const\\ &=& -\frac{1}{2}\left\{w^T\Lambda w - w^T\Lambda m - m^T\Lambda w + m^T\Lambda\mu + \sum_n^N\lambda(y_n^2 - w^Tx_ny_n - y_nw^Tx_n + (w^Tx_n)^2)\right\}\\ \end{eqnarray}$$
Since $\Lambda$ is Symentric matrix , $w^T\Lambda m = m^T\Lambda w$. I know this equation seems to be a little chaotic. But hang in there, every cloud has silver lining. Consequently , we can get following result.
$$\begin{eqnarray}\log p(w|Y,X) &=& -\frac{1}{2}\left\{w^T(\lambda \sum^N_n x_n x_n^T + \Lambda)w -2w^T(\lambda \sum_n^Ny_nx_n+\Lambda m) \right\} +\ const \end{eqnarray}$$

Now we can see equation is Quadratic form of parameter $w$. Therefore posterior distribution has form of multivariate gaussian distribution. $$p(w|X,Y) = N(\hat{m},\hat{\Lambda})$$ $$such\ that,$$ $$\hat{\Lambda} = \lambda\sum_n^N x_nx_n^T + \Lambda$$ $$\hat{m} = \hat{\Lambda}^{-1}(\lambda\sum_n^Ny_nx_n + \Lambda m)$$

1. Mathmatical analysis of predictive model

As a predictive model, What we'd like to know is $p(y^*|x^*,X,Y)$. Neverthless, the fact that prior distribution and posterior distribution is same form as multivariate gaussian distribution is substantiated, thus, we will compute $p(y^*|x^*)$ at first, then we can replace prior distribution with posterior distribution.
Now we think $p(w|y^*,x^*)$. $p(w|y^*,x^*)$ can be deformed as below.

$$\begin{eqnarray}p(w|y^*,x^*) &=& \frac{p(y^*,x^*|w)p(w)}{p(y^*,x^*)}\\ &=& \frac{p(y^*|x^*,w)p(w)p(x^*)}{p(y^*|x^*)p(x^*)}\\ &=& \frac{p(y^*|x^*,w)p(w)}{p(y^*|x^*)}\end{eqnarray}$$$$\therefore \ \ p(y^*|x^*)= \frac{p(y^*|x^*,w)p(w)}{p(w|y^*,x^*)}$$

Take logarithm on both sides, $$\log p(y^*|x^*)= \log p(y^*|x^*,w) - \log p(w|y^*,x^*) + const$$ With respect to $p(y^*|x^*,w)$ , we assmed it's in accordance with multivariate gaussian distribution. We already compute $\log p(w|y^*,x^*)$ when we worked on posterior distribution. Hence,
$$\log p(y^*|x^*) = -\frac{1}{2}\left\{ \lambda-\lambda^2x_*^T(\lambda x_*x_*^T + \Lambda)^{-1}x_*)y_*^2 -2x_*^T\lambda(\lambda x_*x_*^T + \Lambda)^{-1}\Lambda m y_*)\right\} + const$$ Since above equation seems to quadratic function with respect to $y_*$, we can tell predictive model is gaussian distribution :) Consequently, $$ p(y^{*}\ |\ x^{*}) = N(y^{*}\ |\ \mu^{*}, \lambda^{*-1})$$
$$ \mu^{*} = m^{T}x^{*}$$ $$ \lambda^{*-1} = \lambda^{-1} + x^{*T}\Lambda^{-1}x^{*}$$

2. Implementation

Here we're gonna compare the models where one is not trained, on the other hand others is trained.

In [1]:
import numpy as np
import numpy.linalg as LA
from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy.random as random
% matplotlib inline
In [2]:
# Required functions
def make_inputx(x):
    """
    Create matrix consists of x features as 
    column vector.
    
    Arguments
    ----------------------------
    x : values of x which assume numpy.
    """
    ones= np.ones_like(x).reshape(1,-1)
    x1=(x**1).reshape(1,-1)
    x2=(x **2).reshape(1,-1)
    x3=(x **3).reshape(1,-1)
    
    return np.concatenate([ones,x1,x2,x3],axis=0) 

def lambda_aster_inv(x_aster, Lambda, noize_lambda=0.2):
    """
    Compute precision of predictive model.
    Tentatively lambda^-1 is 0.2. 
    """
    return np.diagonal((1/noize_lambda) + x_aster.T@LA.inv(Lambda)@x_aster)

No training. Prediction only with prior distribution.

For the time being, $m$ is $\vec{0}$, $\Lambda^{-1}$ is identity matrix. $\lambda^{-1}$ is 0.2.

In [3]:
x = np.linspace(-1,7,101)

# assumption
x_input = make_inputx(x)
m = np.zeros(4)
Lambda = np.eye(4)

# mu^astar, lambda^aster of predictive model
mu_astar = m@x_input
lambda_astar = lambda_aster_inv(x_input,Lambda)

# Compute expected value
expect_vals = np.array([norm.expect(loc=mu_tmp,scale=lambda_tmp) 
                        for mu_tmp,lambda_tmp in zip(mu_astar,lambda_astar)])
plus_std = expect_vals + np.sqrt(lambda_astar)
minus_std = expect_vals - np.sqrt(lambda_astar)

# Plot predictive model with +- standard deviation
plt.fill_between(x, plus_std,minus_std,facecolor='deepskyblue')
Out[4]:
<matplotlib.collections.PolyCollection at 0x128e52fd0>

As you can see, regression line is bold. Of course, there is no observed data so far. Prior distribution was not updated at all.

Training. Predict with posterior distribution

Let us assume we observed following data. Let's see how the predictive model goes :)

In [5]:
# Create data 
random.seed(42)
data_x = random.uniform(-1,7,20)
# Add some kind of tremor to value of sin with gaussian distribution
data_y = 5 * np.sin(data_x) + random.normal(loc=0,scale=0.2,size=20)

plt.scatter(data_x,data_y)
Out[6]:
<matplotlib.collections.PathCollection at 0x127302470>
In [7]:
# Functions required for posterior distribution
def lambda_posterior(x,noize_lambda,Lambda):
    """
    Compute precision matrix of posterior distribution
    """
    
    return (noize_lambda * x@x.T) + Lambda

def mu_posterior(x,y,mu,noize_lambda,Lambda):
    """
    Compute mean of posterior distribution
    """
    yx=0
    
    for i in range(y.shape[0]):
        yx = yx + (y[i] * x[:,i])
    
    return LA.inv(lambda_posterior(x,noize_lambda,
                                   Lambda))@(noize_lambda*yx + Lambda@mu)
In [8]:
# mu_astar and lambda_astar of predictive model
data_x_input = make_inputx(data_x)
mu_astar_pos = mu_posterior(data_x_input,data_y,mu=m,
                            noize_lambda=0.2,Lambda=Lambda).T@x_input
lambda_astar_pos = np.diagonal((1/0.2)+
                               (x_input.T@LA.inv(lambda_posterior(
                                   data_x_input,0.2,Lambda))@x_input))

# Compute expected value 
expect_vals_post = np.array([norm.expect(loc=mu_tmp,
                                         scale=np.sqrt(lambda_tmp)) for mu_tmp,
                             lambda_tmp in zip(mu_astar_pos,lambda_astar_pos)])
plus_std_post = expect_vals_post + np.sqrt(lambda_astar_pos)
minus_std_post = expect_vals_post - np.sqrt(lambda_astar_pos)
# Plot regression line
plt.fill_between(x,plus_std_post,minus_std_post,facecolor = 'tomato',alpha=0.3)
plt.plot(x,expect_vals_post2, color='tomato')
plt.scatter(data_x,data_y)
Out[9]:
<matplotlib.collections.PathCollection at 0x129d350b8>

No comments:

Post a Comment