Saturday, July 14, 2018

Poisson Mixture Model ①

In this article, I'm gonna work on "Poisson Mixture Model". I believe mixture model is one of the crucial concept to deal with data in real world. I would be glad if you enjoy this article :)

0. Introduction

First of all, we have to discuss about necessity of "mixture model". Basic distribution such as Gaussian distribution or Poisson distributin has important analytical properties. However, simultaneously, they suffer from significant limitations when it comes to real dataset. Let me show you an example. Let's say there are data drawn from two different poisson distribution. As you can see, there are two dominat clumps in distribution. Needless to say, it likely happen in real world. Nevertheless, simple Poisson distribution can't capture this character.

In [21]:
from scipy.stats import poisson
from scipy.stats import gamma
from scipy.stats import dirichlet
from scipy.stats import multinomial
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import collections
import pandas as pd 
In [2]:
# Sample from two different poisson distribution.
# whose parameter is 15, 30 respectively.
sample_15 = poisson.rvs(15,size=200)
sample_30 = poisson.rvs(30,size=300)
sample_freq = np.concatenate([sample_15,sample_30])

freq = collections.Counter(sample_freq)
x = np.array([x  for x in freq.keys()])
y = np.array([y  for y in freq.values()])

# Plot data drawn from two different poisson distribution
plt.bar(x,y,width=1.0)
plt.title('Sample from two different poisson distribution.')
plt.show()

1. Model

I would be glad if you understand the necessity of Mixture model. Now we feel like considering the proccess of yeilding these data as following.

  1. Data has two clumps, each clumps has been drawn from two different Poisson distribution. By the way let's say cluster instead of clumps. Suppose blend ratio of two clusters is $\pi = (\pi_1,\pi_2)^T$ such that $\pi \in (0, 1)$, $\sum_k^2 \pi_k =1$. $\pi$ is drawn from Dirichlet distribution as a prior distribution $Dir(\pi|\alpha)$. If dirichlet distribution stuck in your mind, you can check this article Dirichlet Distribution .
  2. Each cluster has own parameter $\lambda_1$ and $\lambda_2$ of Possin distribution. These paramaeters are generated from gamma distribution as a prior distribution $Gam(\lambda|a,b)$.
  3. Latent variable $s$ is drawn from Categorical distribution $Cat(s|\pi)$ . We assume $s$ is one hot vector. This latent variable is utilized to determine which cluster the data belongs to.
  4. Data is drawn from the poisson distribution which is opted by latent variable $s$.$$x \sim \prod_k^2Poi(x|\lambda_k)^{s_{k}}$$

Onece you set up the model, I highly recommend you to sample from it. The following is the implementation of sampling.

In [46]:
class poiss_mix():
    
    def __init__(self,a,b,alpha,k_num):
        
        # The number of cluster
        self.k_num = k_num
        
        # Hyper parameter of gamma distribution
        self.a=a
        self.b=b
        
        # Hyper parameter for diricret distibution
        # Assume k_num dimention
        self.alpha=alpha
        
        # compute parameter of poisson distribution
        self.lam = gamma.rvs(a=self.a,size=self.k_num)
        
        # compute parameter of categorical distribution
        self.pi = dirichlet.rvs(alpha=self.alpha)
        
    def sample(self):
        
        # compute latent variable
        latent_val = multinomial.rvs(n=1,p=self.pi[0], size=1)
        
        sample = 1
        
        for val, lamb in zip(latent_val[0],self.lam):
            # latent variable is one hot vector.
            sample = sample * (poisson.rvs(lamb,size=1)[0] ** val)

        return sample 
In [9]:
poi = poiss_mix(a=30,b=1,alpha=[10,5],k_num=2)
samples=[]
# Tentatively, we're gonna sample 1000 data from the model 
# we've created.
for i in range(1000):
    samples.append(poi.sample())
    
sample_count = np.array([ [key,value] for key, value in 
                                             collections.Counter(samples).items()])

# Plot the sampling as a bar graph.
plt.bar(sample_count[:,0],sample_count[:,1],width=1)
Out[9]:
<Container object of 45 artists>

It seems this distribution has two clumps. Nonetheless it's actually it's coincidence. Since I assign same value to hyper parameter $a$, $b$ for each cluster. To put it simpy, I set $a$ and $b$ is the same for each cluster in the above explanation. However, don't worry, after computing posterior distribution, you can see assigning different value to $a$ and $b$ for each cluster.
Since it's gonna be too long to write continuation in an article, I think this is good point to wrapp up :) I'll write an article of continuation very soon. See you there !

No comments:

Post a Comment