Tuesday, February 11, 2020

Infinite Poisson Mixture model with Dirichlet Process

In this post, I will share infinite poisson mixture model with dirichlet process as a prior distribution on the number of clusters. When it comes to clustering, sometimes class number 'k' is hard to set. I think Dirichlet process can be one of the solution to rectify it. Regarding approximation of posterior distribution, sampling method will be presented, namely Collapsed Gibbs sampling. I'm glad if you find some fun out of this post.

Dirichlet Process Poisson Mixture Model

1. Collapsed Gibbs sampling

First of all, we're gonna start from Joint distribution of poisson mixture model which can be expressed as following.

$$p(X,Z,\lambda,\pi) = p(X | \lambda, Z)p(Z|\pi)p(\pi)p(\lambda)$$

where X is observation, Z is latent variable, $\lambda$ is parameter of poisson distribution with prior distribution p($\lambda$), and $\pi$ is parameter of categorical distribution with prior distribution $p(\pi)$. If you are interested in poisson mixture model, you can check following link.
Poisson Mixture Model①

If your interest is only how the observation is classified, you can marginalize out $\lambda$, $\pi$. As long as conjugate prior is applied for $p(\lambda)$ and $p(\pi)$, marginalization is tractable.

$$p(X,Z) = \int p(X,Z,\lambda,\pi)d\lambda d \pi$$

We can approximate posterior distribution $p(Z|X)$ by gibbs sampling.

$$p(z_n|X,S_{\backslash n}) \propto p(x_n|X_{\backslash n},z_n,Z_{\backslash n})p(z_n|Z_{\backslash n})$$

For $p(x_n|X_{\backslash n},z_n,Z_{\backslash n})$, $$p(x_n|X_{\backslash n},z_{n,k}=1,Z_{\backslash n}) = NB(x_n | \sum_{n^{'}\neq n}z_{n^{'},k}x_{n^{'}}+a,\sum_{n^{'}\neq n}z_{n^{'},k}+b)$$

Regarding $p(z_n|Z_{\backslash n})$, we wanna make it have infinite categories.

2. Infinite mixture model

Now we're gonna think about dirichlet distribution with infinite categories. Let's say parameters of dirichlet distribution be $\alpha / K$. where $K$ is number of cluster. Then with marginalization over $\pi$, $p(z_i = k|z_{1:n}^{\backslash i},\alpha)$ can be computed as followings,

$$\begin{eqnarray} p(z_{i}=k|z_{1:n}^{\backslash i}, \alpha) = \begin{cases}\frac{n_k^{\backslash i}+\frac{\alpha}{K}}{n-1 + \alpha} \ \ if\ k \in \kappa^+(z^{\backslash i}_{1:n})\\ \frac{\frac{\alpha}{K}}{n-1+\alpha} \ \ if\ k \notin \kappa^+(z^{\backslash i}_{1:n})\end{cases}\end{eqnarray}$$

where $\kappa^{+}(z^{\backslash i}_{1:n})$ means the category which is already drawned. If $K \to \infty$, then getting new category would be ZERO. This is not what I want. From this result, the probability of getting new category is same over all $k \notin \kappa^+(z^{\backslash i}_{1:n})$ cluster. Therefore probability of getting new category can be computed by multiplying $K - |\kappa^+(z^{\backslash i}_{1:n})|$.

$$p(z_i \notin \kappa^+(z^{\backslash i}_{1:n})|z^{\backslash i}_{1:n},\alpha)) = \left(1- \frac{|\kappa^{+}(z^{\backslash i}_{1:n})}{K}\right)\frac{\alpha}{n-1+\alpha}$$

If $K \to \infty $,
$$\begin{eqnarray} p(z_{i}=k|z_{1:n}^{\backslash i}, \alpha) = \begin{cases}\frac{n_k^{\backslash i}}{n-1\alpha} \ \ if\ k \in \kappa^+(z^{\backslash i}_{1:n})\\ \frac{\alpha}{n-1+\alpha} \ \ if\ k \notin \kappa^+(z^{\backslash i}_{1:n})\end{cases}\end{eqnarray}$$

3. Sample data

Let's apply infinite poisson mixture model to toy data. The following is sample data which is created by two different poisson distribution. Hence we expect that result of gibbs sampling would show there are two classes.

In [1]:
from scipy.stats import poisson
from scipy.stats import gamma
from scipy.stats import dirichlet
from scipy.stats import multinomial
from scipy.stats import bernoulli
from scipy.stats import nbinom
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import cm
plt.style.use('dark_background')
In [2]:
# Sample from two different poisson distribution.
# whose parameter is 15, 30 respectively.
sample_15 = poisson.rvs(15,size=200,random_state=10)
sample_30 = poisson.rvs(30,size=300,random_state=10)
X = np.concatenate([sample_15,sample_30])

freq = Counter(X)
x = np.array([x  for x in freq.keys()])
y = np.array([y  for y in freq.values()])

# Plot data drawn from two different poisson distribution
plt.bar(x,y,width=1.0)
plt.title('Sample from two different poisson distribution.')
plt.show()

4. Apply poisson mixture model with dirichlet process

In [3]:
MAX_iter = 200
sample_z = []

# parameters of prior distribution
# alphat : Dirichlet distribution
# a, b : gamma distribution
alpha = 4
a = 1
b = 1

debug=False

# initialize latent variables
z = np.zeros(X.shape[0])

for i in range(MAX_iter):
    
    latent = []
    # sample n times
    for n in range(X.shape[0]):
        exist_cat = np.unique(np.delete(z,obj=n))
        z_prob = {}
        x_prob = {}
        if debug:print(n,'exist_cat :',exist_cat)
        # cluster number is assigned incrementaly from 0.
        for k in exist_cat:
            z_prob[k] = (np.where(np.delete(z,obj=n) == k)[0].sum()) / \
                        (X.shape[0] - 1 + alpha)

            a_hat = np.delete(X,obj=n)[np.where(np.delete(z,obj=n)==k)[0]].sum()+a
            b_hat = (np.delete(z,obj=n)==k).sum()+b
            temp =1/(b_hat+1)
            x_prob[k] = nbinom.pmf(X[n],a_hat,1-temp)
        # probability of getting new cluster
        # label of new cluster is smallest number except for exisiting labels
        for i in range(exist_cat.shape[0]+1):
            if i not in set(exist_cat):
                new_c=i
                break
        z_prob[new_c] = alpha / (X.shape[0] -1 + alpha)
        temp=1/(b+1)
        x_prob[new_c] = nbinom.pmf(X[n],a,1-temp)
        
        category = np.append(exist_cat, new_c)
        if debug:print(n,'category :',category)
        pi_list = np.array([ z*x for z, x in zip(z_prob.values(),x_prob.values())])
        pi =  pi_list / pi_list.sum()
        if debug:print(n,'pi :',pi)
        z[n] = category[np.where(multinomial.rvs(n=1,p=pi) == 1)[0]]
        if debug:print(n,'sample',z[:n+1])
    if debug:print(z)
    sample_z.append(z.copy())     
In [6]:
sample_z = np.array(sample_z)[10:]
num_of_cluster = np.array([np.unique(s).shape[0] for s in sample_z])
count = Counter(num_of_cluster)

temp_dict = {i:count[i+1] for i in range(4)}
x = temp_dict.keys()
Y = temp_dict.values()

plt.bar(x=x,height=Y,tick_label=[1,2,3,4])
plt.title('Number of cluster',fontsize=16)
plt.show()

Since early sampling might have the effect from initialization, first 10 sampling was eliminated. As we expect, cluster number "2" has dominant in 190 sampling. Interestingly there is few sampling where there are 3 clusters.

Thursday, October 24, 2019

Marginal Likelihood of Gaussian Process classification

This article is a continuation from the following link, "gaussian process for classification".

Gaussian Process for classification

When you work on model selection or setting hyper parameter to specific value by type 2 maximum likelihood, computing marginal likelihood is always imperative. However marginal likelihood of gaussian process classification is intractable due to the presence of non-linear function. Therefore in this article, I will show one of example of calculating marginal likelihood with Laplace approximation.

Marginal Likelihood of gaussian process classification

1. Why is it intractable??

Marginal likelihood of gaussian process classification can be expressed as following,

$$\begin{eqnarray} p(t_N) &=& \int p(t_N|a_N)p(a_N)da_N \\ &=& \int \prod_{n=1}^N \sigma(a_n)^{t_n}(1-\sigma(a_n))^{1-t_n} N(a_N)da_N \end{eqnarray} \tag{1}$$


Because of the presence of sigmoid function, this integral is intractable. Hence approximation is required.

2. Laplace approximation

I will introduce Laplace approximation in a nutshell.
Let's say p(x) is unknown probability distribution and it's intractable. Then consider $f(x)$ such that

$$p(x) = \frac{1}{Z} f(x)$$

where $Z = \int f(x)dx$ is normalization term. We assume $f(x)$ is easily attainable.
By applying taylor expansion to logarithm of $f(x)$, $f(x)$ can be approximated as following,

$$ f(x) \approx f(x_0)N(x|x_0,A)$$

where $x_0$ saitisfies $\frac{d}{dx}ln f(x)|_{x=x_0} = 0$. Therefore,

$$\int f(x)dx \approx f(x_0) \cdot -{(2\pi)^{\frac{d}{2}}|A|^{\frac{1}{2}}}$$

We can apply this approximation to equation (1).

3. Apply to toy data

At this time, gaussian process classification will be applied to following toy data.

In [1]:
from gp_classifier import gp_classifier
import kernel
import numpy as np
from scipy.stats import multivariate_normal,uniform
import math
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('dark_background')
In [2]:
mean_list = np.array([[-2,1],[1,-1.5],[1.5,0],
                      [0,1.5],[0,3],[3,3],[3,-3.5],
                      [-3,4],[-1,-4],[-4,-4]])

for i,mean in enumerate(mean_list):
    # sample number 
    samp_num = math.floor(uniform.rvs(loc=10,scale=15))
    # covariance
    cov_val = uniform.rvs(loc=-0.6,scale=0.6)

    if i == 0:
        # label
        labels = np.full(samp_num,0)
        # data
        X = multivariate_normal.rvs(
        mean=mean,cov=[[1.5,cov_val],[cov_val,1.5]],
        size=samp_num)
    else:
        # label
        labels = np.concatenate([labels,np.full(samp_num,i%2)])
        # data
        X = np.concatenate([X,multivariate_normal.rvs(
            mean=mean,cov=[[0.2,cov_val],[cov_val,0.2]],
            size=samp_num)])
# plot toy data        
for label, col in zip([0,1],['blue','red']):
    mask = label == labels
    plt.scatter(X[mask,0], X[mask,1],color=col)

In order to compare values of marginal likelihood, four gaussian process classifications with differenct kernel functions are implemented here,

  • linear kernel
  • gaussian kernel
  • periodical kernel
  • linear and gaussian kernel

You can check code which was utilized in thist article via github.
gp_classifier.py
kernel.py

In [3]:
import gp_classifier
import importlib
importlib.reload(gp_classifier)
from gp_classifier import gp_classifier
In [4]:
plt.figure(figsize=(14,14))
kernel_dict = {'linear kernel':kernel.linear_kernel,
               'periodical kernel':kernel.periodic_kernel,
               'gauss kernel':kernel.gauss_kernel,
              'linear+gauss kernel': kernel.linear_gauss_kernel}

for i,(kernel_name, kernel_func) in enumerate(kernel_dict.items()):
    plt.subplot(2,2,i+1)
    # create class of gaussian process classifier
    gclf = gp_classifier(kernel_func=kernel_func)
    gclf.fit(X,labels)

    x = np.linspace(-4,4,101)
    y = np.linspace(-4,5,101)

    xx, yy = np.meshgrid(x,y)

    zz = gclf.predict_proba(np.vstack([xx.ravel(),yy.ravel()]).T)

    ctf = plt.contourf(xx,yy,zz[:,1].reshape(xx.shape),cmap='coolwarm')
    plt.colorbar(ctf)

    # Plot toy data
    for label, col in zip(np.unique(labels),['blue','red']):
        mask = np.where(labels == label)[0]
        plt.scatter(X[mask,0],X[mask,1],color=col)

    plt.xlim(-4,4)
    plt.ylim(-4,5)

    plt.title('{} log mlh = {:.2f}'.format(kernel_name,gclf.get_marginal_likelihood()),
              fontsize=14)
plt.show()

Gaussian kernel has highest marginal log likelihood among them. Periodical kernel seems to face difficulty to classify this data set.

4. Reference

  • Pattern Recognition and Machine Learning Christopher Bishop