Thursday, October 24, 2019

Marginal Likelihood of Gaussian Process classification

This article is a continuation from the following link, "gaussian process for classification".

Gaussian Process for classification

When you work on model selection or setting hyper parameter to specific value by type 2 maximum likelihood, computing marginal likelihood is always imperative. However marginal likelihood of gaussian process classification is intractable due to the presence of non-linear function. Therefore in this article, I will show one of example of calculating marginal likelihood with Laplace approximation.

Marginal Likelihood of gaussian process classification

1. Why is it intractable??

Marginal likelihood of gaussian process classification can be expressed as following,

$$\begin{eqnarray} p(t_N) &=& \int p(t_N|a_N)p(a_N)da_N \\ &=& \int \prod_{n=1}^N \sigma(a_n)^{t_n}(1-\sigma(a_n))^{1-t_n} N(a_N)da_N \end{eqnarray} \tag{1}$$


Because of the presence of sigmoid function, this integral is intractable. Hence approximation is required.

2. Laplace approximation

I will introduce Laplace approximation in a nutshell.
Let's say p(x) is unknown probability distribution and it's intractable. Then consider $f(x)$ such that

$$p(x) = \frac{1}{Z} f(x)$$

where $Z = \int f(x)dx$ is normalization term. We assume $f(x)$ is easily attainable.
By applying taylor expansion to logarithm of $f(x)$, $f(x)$ can be approximated as following,

$$ f(x) \approx f(x_0)N(x|x_0,A)$$

where $x_0$ saitisfies $\frac{d}{dx}ln f(x)|_{x=x_0} = 0$. Therefore,

$$\int f(x)dx \approx f(x_0) \cdot -{(2\pi)^{\frac{d}{2}}|A|^{\frac{1}{2}}}$$

We can apply this approximation to equation (1).

3. Apply to toy data

At this time, gaussian process classification will be applied to following toy data.

In [1]:
from gp_classifier import gp_classifier
import kernel
import numpy as np
from scipy.stats import multivariate_normal,uniform
import math
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('dark_background')
In [2]:
mean_list = np.array([[-2,1],[1,-1.5],[1.5,0],
                      [0,1.5],[0,3],[3,3],[3,-3.5],
                      [-3,4],[-1,-4],[-4,-4]])

for i,mean in enumerate(mean_list):
    # sample number 
    samp_num = math.floor(uniform.rvs(loc=10,scale=15))
    # covariance
    cov_val = uniform.rvs(loc=-0.6,scale=0.6)

    if i == 0:
        # label
        labels = np.full(samp_num,0)
        # data
        X = multivariate_normal.rvs(
        mean=mean,cov=[[1.5,cov_val],[cov_val,1.5]],
        size=samp_num)
    else:
        # label
        labels = np.concatenate([labels,np.full(samp_num,i%2)])
        # data
        X = np.concatenate([X,multivariate_normal.rvs(
            mean=mean,cov=[[0.2,cov_val],[cov_val,0.2]],
            size=samp_num)])
# plot toy data        
for label, col in zip([0,1],['blue','red']):
    mask = label == labels
    plt.scatter(X[mask,0], X[mask,1],color=col)

In order to compare values of marginal likelihood, four gaussian process classifications with differenct kernel functions are implemented here,

  • linear kernel
  • gaussian kernel
  • periodical kernel
  • linear and gaussian kernel

You can check code which was utilized in thist article via github.
gp_classifier.py
kernel.py

In [3]:
import gp_classifier
import importlib
importlib.reload(gp_classifier)
from gp_classifier import gp_classifier
In [4]:
plt.figure(figsize=(14,14))
kernel_dict = {'linear kernel':kernel.linear_kernel,
               'periodical kernel':kernel.periodic_kernel,
               'gauss kernel':kernel.gauss_kernel,
              'linear+gauss kernel': kernel.linear_gauss_kernel}

for i,(kernel_name, kernel_func) in enumerate(kernel_dict.items()):
    plt.subplot(2,2,i+1)
    # create class of gaussian process classifier
    gclf = gp_classifier(kernel_func=kernel_func)
    gclf.fit(X,labels)

    x = np.linspace(-4,4,101)
    y = np.linspace(-4,5,101)

    xx, yy = np.meshgrid(x,y)

    zz = gclf.predict_proba(np.vstack([xx.ravel(),yy.ravel()]).T)

    ctf = plt.contourf(xx,yy,zz[:,1].reshape(xx.shape),cmap='coolwarm')
    plt.colorbar(ctf)

    # Plot toy data
    for label, col in zip(np.unique(labels),['blue','red']):
        mask = np.where(labels == label)[0]
        plt.scatter(X[mask,0],X[mask,1],color=col)

    plt.xlim(-4,4)
    plt.ylim(-4,5)

    plt.title('{} log mlh = {:.2f}'.format(kernel_name,gclf.get_marginal_likelihood()),
              fontsize=14)
plt.show()

Gaussian kernel has highest marginal log likelihood among them. Periodical kernel seems to face difficulty to classify this data set.

4. Reference

  • Pattern Recognition and Machine Learning Christopher Bishop