Bayesian approach for missing value¶

Almost all projects I've been working on as a datascientist,　I have no choice but to deal with missing value. I mean, real data is almost always involved with missing value. Sometimes imputation and deletion tend to be discussed in this context. However I'd like to share other way, 'probabilistic bayesian approach', in this article. Actually, I've read quite informative and intriguing following article. 　

データに欠損がある場合の教師あり学習

Basically I implemented what is written in this blog post and I added the comparison of imputation with mean value on top of that. I will be glad if you enjoy this article.

0. Mathematic theory¶

First of all, we are gonna deal with binary classification. As a model, Logistic Regression is applied here. Let $y_n$ be target variable, $$p(y_n) = Ber(y_n|\sigma(w^Tx_n))$$
We apply prior distribution to $w$ as following, $$p(w) = N(w|0, \Sigma_w)$$
Then today we apply prior distribution for $x$, $$p(X) = \prod^{N}_{n=1}\prod^{D}_{d=1}N(x_{n,d}|0, \sigma^2_x)$$
Joint distribution would be, $$p(Y, w, X) = p(Y|w,X)p(w)p(X)$$ In this article, we are to deal with some missing value. Thus, we separate dateset into observation and missing value like, $X = {X_o, X_m}$. Posterior distribution would be,
$$p(w, X_m | Y, X_o) \propto p(Y|w, X_o, X_m) p(w)p(Xm)$$

1. Preperation of toy data¶

First of all, we might wanna prepare some toy data here. It's sampled according to model we set.

In [24]:

import pandas as pd
import numpy as np
import numpy.linalg as LA
from scipy.stats import bernoulli
from scipy.stats import norm
from scipy.stats import multivariate_normal
from scipy.stats import uniform
import matplotlib.pyplot as plt
%matplotlib inline

In [25]:

# Required function
def sigmoid(X):
        """
        Map the value with sigmoid function
        """
        return 1/(1 + (np.e ** (-X)))

In [18]:

# Dimention of data
D=2
# number of observation
N=30
# parameter for prior distribution
cov_w = 20 * np.eye(D)
mean_w = np.zeros(D)
sig_x = 1
mean_x = 0 
# Sample data along with generative model
# X is assumed independent each other
X_raw=norm.rvs(loc=mean_x, scale=sig_x, size=N*D,
               random_state=6).reshape(N,D)
w = multivariate_normal.rvs(mean=mean_w,
                            cov=cov_w,random_state=5)
labels = bernoulli.rvs(p=sigmoid(X_raw@w.reshape(-1,1)), 
                       random_state=7).ravel()

# plot toy data
for i, col in zip([0,1],['blue','red']):
    plt.scatter(X_raw[labels==i][:,0],X_raw[labels==i][:,1],color=col)

plt.title('Sample from model',fontsize=14)
plt.xlim(-3,3)
plt.ylim(-3,3)
plt.show()

2. Experimental data¶

Next, missing value is gonna be created. Here I set 0.5 as a ratio of missing value. Also I wanna compare the bayesian approach with imputation and reduction approach. Therefore respective data is also created here :)

In [21]:

# create missing data
X_miss = X_raw.copy()
ratio = 0.5
X_miss = np.array([
    np.nan if ratio > uniform.rvs(loc=0,scale=1) else x 
    for x in X_raw.ravel()
]).reshape(X_miss.shape)  

# delete data including missing data
X_del = X_miss[~np.any(np.isnan(X_miss),axis=1)]
labels_del = labels[~np.any(np.isnan(X_miss),axis=1)]

# impute with mean value
col0_mean = X_miss[:,0][~np.isnan(X_miss[:,0])].mean()
col1_mean = X_miss[:,1][~np.isnan(X_miss[:,1])].mean()
X_mean = X_miss.copy() 
X_mean[np.isnan(X_mean[:,0]),0] = col0_mean
X_mean[np.isnan(X_mean[:,1]),1] = col1_mean

3. Metropolis Hastings method¶

Since the posterior distribution of x and w is intractable, approximation or sampling method should be applied. Here I'm gonna apply "Metropolis Hastings method" which is one of MCMC algorithm.

In [26]:

def MH(X, y, D, sigma_x, cov_w, sig_prop, iter_num, burnin):
    """
    sig
    """
    N_miss = np.isnan(X).sum()
    # Initial sample
    if N_miss !=0:
        X_miss = multivariate_normal.rvs(
            mean=np.zeros(N_miss), 
            cov=(sig_prop**2) * np.eye(N_miss)
        )
    else:
        X_miss = 0
    
    W = multivariate_normal.rvs(
        mean=np.zeros(D),
        cov=(sig_prop ** 2) * np.eye(D)
    )
    # Array to store sampling
    w_sample = np.empty((iter_num,D))
    x_sample = np.empty((iter_num,N_miss))
    
    for i in range(iter_num):
        # Sampling W
        W_new = multivariate_normal.rvs(
            mean=np.zeros(D),
            cov=(sig_prop **2)* np.eye(D)
        ) 
        # Compute probability of rejection
        X_temp = insert_X(X,X_miss)
        a = np.exp(
            min(
                (ln_p_tilde(X_temp, y, W_new, cov_w, sigma_x,D) -\
                ln_p_tilde(X_temp, y, W, cov_w, sigma_x,D)),0
            )
        )
        # rejected of accepted
        W = W_new if uniform.rvs(loc=0,scale=1) < a else W
        w_sample[i] = W
        
        # Sampling X
        if N_miss !=0:
            X_miss_new = multivariate_normal.rvs(
                mean = np.zeros(N_miss),
                cov = (sig_prop**2) * np.eye(N_miss)
            )
            # Compute probability of rejection
            X_temp1 = insert_X(X, X_miss_new)
            X_temp2 = insert_X(X, X_miss)
            a = np.exp(
                min(
                    (ln_p_tilde(X_temp1, y, W, cov_w, sigma_x,D) -\
                    ln_p_tilde(X_temp2, y, W, cov_w, sigma_x,D)),0
                )
            )
            # rejected of accepted
            X_miss = X_miss_new if uniform.rvs(loc=0,scale=1) < a else X_miss
            x_sample[i] = X_miss
        
    # eliminate sample according to burunin
    w_sample = w_sample[burinin:,:]
    x_sample = x_sample[burinin:,:]
        
    return w_sample, x_sample
    
def insert_X(X,X_miss):
    """
    Set values of sample to missing values
    """
    X = X.copy()
    X[np.isnan(X)] = X_miss
    return X
    
def ln_p_tilde(X, y, w, cov_w, sigma_x,D):
    """
    Compute p tilda
    """
    # compute log likelihood + log prior w + log prior x
    return (y.reshape(-1,1) * np.log(sigmoid(X@w.reshape(-1,1))) + (1- y).reshape(-1,1) *
            np.log(1- sigmoid(X@w.reshape(-1,1)))).sum() -\
            0.5 * w.reshape(1,-1)@LA.inv(cov_w)@w.reshape(-1,1) -\
            0.5 * 1/(sigma_x **2)* ((X**2).sum()) # compute prior of x

4. Comparison¶

This is most fun part of this article! Now is the time to compare bayesian approach and imputation and deduction.

In [23]:

# parameter for Markov chain Monte carlo
# sigma of proposal distribution
sig_prop = 1
burinin = 1000
max_iter = 3000

titles = ['Full data(N={})'.format(X_raw.shape[0]), 
          'Imcomplete data(N={})'.format(X_miss.shape[0]),
          'Reduced data(N={})'.format(X_del.shape[0]),
         'Imputed with mean data(N={})'.format(X_mean.shape[0])]

sample_w_miss, sample_x_miss = MH(
    X=X_miss, y=labels, D=D, sigma_x=sig_x, cov_w=cov_w, 
    sig_prop=sig_prop, iter_num=max_iter, burnin=burinin
)

sample_w_nomiss, sample_x_nomiss = MH(
    X=X_raw, y=labels, D=D, sigma_x=sig_x, cov_w=cov_w, 
    sig_prop=sig_prop, iter_num=max_iter, burnin=burinin
)

sample_w_del, sample_x_del = MH(
    X=X_del, y=labels_del, D=D, sigma_x=sig_x, cov_w=cov_w, 
    sig_prop=sig_prop, iter_num=max_iter, burnin=burinin
)

sample_w_mean, sample_x_mean = MH(
    X=X_mean, y=labels, D=D, sigma_x=sig_x, cov_w=cov_w, 
    sig_prop=sig_prop, iter_num=max_iter, burnin=burinin
)

plt.figure(figsize=(15,14))
for num, (X, sample_w, sample_x) in enumerate(
    zip(
        [X_raw, X_miss, X_del, X_mean],
        [sample_w_nomiss, sample_w_miss, sample_w_del, sample_w_mean],
        [sample_x_nomiss, sample_x_miss, sample_x_del, sample_x_mean]
    )
):

    # Summarize
    idx_nan = np.where(np.any(np.isnan(X),axis=1))[0]
    X_exp = insert_X(X, sample_x.mean(axis=0))
    X_std = 0*X.copy()
    X_std = insert_X(X_std, np.std(sample_x, axis=0))

    # Plot prediction
    x = np.linspace(-2.5, 2.5,101)
    y = np.linspace(-2.5, 2.5,101)

    xx, yy = np.meshgrid(x,y)

    # compute prediction
    labels_pred = np.array([
        sigmoid(X=np.vstack([xx.ravel(), yy.ravel()]).T@w.reshape((-1,1)))
        for w in sample_w
    ]).mean(axis=0)
    
    plt.subplot(2,2,num+1)
    # plot prediction
    plt.colorbar(plt.contourf(xx,yy,labels_pred.reshape(xx.shape),cmap='coolwarm'))

    # plot points with missing value
    for i in idx_nan:
        if labels[i] == 1:
            plt.plot([X_exp[i,0] - X_std[i,0], X_exp[i,0] + X_std[i,0]],
                    [X_exp[i,1], X_exp[i,1]], color='red', alpha=0.4, linewidth=5)
            plt.plot([X_exp[i,0], X_exp[i,0]],
                    [X_exp[i,1] - X_std[i,1], X_exp[i,1] + X_std[i,1]], color='red', alpha=0.4, linewidth=5)
            plt.scatter(X_exp[i,0],X_exp[i,1], color='red', marker='x')
        else:
            plt.plot([X_exp[i,0] - X_std[i,0], X_exp[i,0] + X_std[i,0]],
                    [X_exp[i,1], X_exp[i,1]], color='blue', alpha=0.5, linewidth=5)
            plt.plot([X_exp[i,0], X_exp[i,0]],
                    [X_exp[i,1] - X_std[i,1], X_exp[i,1] + X_std[i,1]], color='blue', alpha=0.5, linewidth=5)
            plt.scatter(X_exp[i,0],X_exp[i,1], color='blue', marker='x')

    # plot points without missing value
    mask = np.where(~np.any(np.isnan(X),axis=1))[0]
    for i, col in zip([0,1],['blue','red']):
        plt.scatter(X[mask][labels[mask]==i][:,0],X[mask][labels[mask]==i][:,1],color=col)

    plt.xlim(-2.5,2.5)
    plt.ylim(-2.5,2.5)
    plt.title(titles[num], fontsize=14)
    
plt.show()

The one in upper left is the result of full data which mean there is no missing value. Then upper right one seems to have most similar result to the one in upper left compare to others:) The decision surface in reduced data is most different. Also imputed data with mean looks a little strange, it tend to be on a line (of course, it's imputed with mean value.) as if it's fabricated.

Tech Notes

Tuesday, May 28, 2019