Monday, May 28, 2018

Properties of "expected value"

When I worked on "variational inference" of "Bayesian machine learning" using ''mean-field approximation", because of inadequate understanding "expected value", I was stuck for several hours... Then I thought, there might be someone in the world who has the same adversity. Hence I will share with you regarding properties of "expected value".

Properties of "expected value "

Just in case, before getting into properties of expected value we should be on the same page on what the expected value is .
Let's say that we have random variable $x$, and $f(x)$ as a function of $x$. Expected value of random variable (continuous) is presented as below.
$$ < f(x)> _{p(x)} = \int f(x)p(x)dx $$

1. Linearity

When it comes linearity, you can derive as below with comparative ease :) $$\begin{eqnarray}<f(x) + g(x)>_{p(x)} &=& \int \{f(x)+g(x)\}p(x)dx\\ &=& \int f(x)p(x)dx + \int g(x)p(x)dx\\ &=& <f(x)>_{p(x)} + <g(x)>_{p(x)}\end{eqnarray}$$ $$\begin{eqnarray}<cf(x)>_{p(x)} &=& \int cf(x)p(x)dx\\ &=& c\int f(x)p(x)dx\\ &=& c<f(x)>\ \ \ where \ c \in \mathcal{R} \\ \end{eqnarray}$$

2. Multiple random variable 1

If there are maultiple random variable ($z_1, z_2, z_3$), expected value can be calculated as below. $$\begin{eqnarray} <f(z_1, z_2, z_3)>_{p(z_1)p(z_2)p(z_3)} &=& <\int p(z_1)f(z_1, z_2, z_3)dz_1>_{p(z_2)p(z_3)}\\ &=&<(f\ of\ z_2\ and\ z_3) >_{p(z_2)p(z_3)}\\ &=&<\int p(z_2)(f\ of\ z_2\ and\ z_3)dz_2 >_{p(z_3)}\\ &=&<(f\ of\ z_3) >_{p(z_3)}\\ &=& \int p(z_3)(f\ of\ z_3)dz_3\\ &=& Final\ expected\ value \end{eqnarray}$$

3. Multiple random variable 2

This is a pattern where $f$ is a function of only $z_1$, whereas ramdom variables are $z_1,z_2,z_3$, $$\begin{eqnarray} <f(z_1)>_{p(z_1)p(z_2)p(z_3)} &=& <\int p(z_2)f(z_1)dz_2>_{p(z_1)p(z_3)}\\ &=&<f(z_1)\int p(z_2)dz_2>_{p(z_1)p(z_3)}\\ &=&<f(z_1)>_{p(z_1)p(z_3)}\\ &=&<\int p(z_3)f(z_1)dz_3>_{p(z_1)}\\ &=&<f(z_1)\int p(z_3)dz_3>_{p(z_1)}\\ &=& <f(z_1)>_{p(z_1)}\\ \end{eqnarray}$$

Sunday, May 27, 2018

Pandas Tips① ~How to add rows by splitting data using separator ??~

Pandas Tips① ~How to add rows by splitting data using separator ??~

In data munging, or calculate statistics, or whatever else, "dataframe" in Python is really convenient and efficient. I'm really fond of using pandas. Nevertheless, I'm inclined to forget some useful technic. Thereby, I've decided to write it down here. First post is regarding "How to add rows by splitting data using separator??" I hope you will enjoy this :)

For instance, let us assume there is data as bellow.

In [1]:
import pandas as pd
In [2]:
purchase_data = pd.DataFrame([['john','apple|orange'],
                             ['robert','apple|orange|blueberry'],
                             ['david','banana']],columns=['name','fruit'])
purchase_data.head()
Out[2]:
name fruit
0 john apple|orange
1 robert apple|orange|blueberry
2 david banana

However, in some cases, the dataframe equipped with rows by fuit is required. You can create new dataframe as bellow.

In [3]:
column_list = ['name','fruit']
# Prepare empty data frame
purchase_data_new = pd.DataFrame(columns=column_list)

for index,row in purchase_data.iterrows():
    # When creating dataframe with dictionary, order is not guaranteed. 
    # Hence I specify "columns" parameter.
    purchase_data_temp = pd.DataFrame({row.index[0]:row.values[0],
                                      row.index[1]:row.values[1].split('|')},
                                      columns=column_list)
    purchase_data_new = purchase_data_new.append(purchase_data_temp)

purchase_data_new
Out[3]:
name fruit
0 john apple
1 john orange
0 robert apple
1 robert orange
2 robert blueberry
0 david banana

Should you have a question regarding the part where creating dataframe with dictionary, I'll share with you a little :)
Basically, you can create dataframe with dictionary as following. (By the way, you can see that this method doesn't guarranty order of columns)

In [4]:
df_temp = pd.DataFrame({'name':['hiroshi'],'fruit':['kiwi']})
df_temp.head()
Out[4]:
fruit name
0 kiwi hiroshi

When you specify value of dictionary as string not list, it means somewhat of constant value.

In [5]:
df_temp2 = pd.DataFrame({'name':'hiroshi','fruit':['kiwi','apple','banana']})
df_temp2.head()
Out[5]:
fruit name
0 kiwi hiroshi
1 apple hiroshi
2 banana hiroshi

Then, the point is that "split" method returns "list" type. Therefore, in david's turn in above example, there is no error :)

In [6]:
# Unfortunatelly, This gonna be error...
# df_temp3 = pd.DataFrame({'name':'hiroshi','fruit':'kiwi'})
df_temp3 = pd.DataFrame({'name':'hiroshi','fruit':'kiwi'.split('|')})
df_temp3.head()
Out[6]:
fruit name
0 kiwi hiroshi

You can see even though there is no '|' in fruit value, "split" returns "list" type as bellow.

In [7]:
'kiwi'.split('|')
Out[7]:
['kiwi']

If anyone know more efficient know, It's gonna be really grateful to let me know ! :)

Saturday, May 26, 2018

Mutual Information

Mutual Information

In this article, I will share with you regarding "Mutual Information". I believe this is one of crucial concept when working on data science, machine learning, deep learning and list goes on and on. I would be glad if you enjoy this article :)

1. Conditional Entropy

"What is Entropy" was discussed in the following link.
https://hiroshiu.blogspot.com/2018/04/what-is-entropy.html
Before getting into the "Mutual Information", we have to wrap our head around "Conditional Entropy".
"Conditional Entropy" $H(x)$ of two discrete random variables $X(x_1,x_2,\cdots,x_n), Y(y_1,y_2,\cdots,y_m)$ is captured in the followings $$H(X|y_{1}) = -\sum_{i=1}^{n}P(x_{i}|y_{1})\log(P(x_{i}|y_{1}))$$
Therefore,
\begin{eqnarray} H(X|Y) &=& \sum_{j=1}^{m}P(y_{j})\sum_{i=1}^{n}P(X_{i}|y_{j})\log({P(X_{i}|y_{j})})\\ &=&\sum_{j=1}^{m}\sum_{i=1}^{n}P(x_{i} \cap y_{j})\log(\frac{P(x_{i}\cap y_{j})}{P(y_{j})}) \end{eqnarray}

2. Mutual Information

Mutual Information is $H(X) - H(X|Y)$, which measures how much knowing one of vaiables reduce uncertainty of others. $$I(X;Y) = H(X) - H(X|Y) = \sum_{y\in{Y}}\sum_{x\in{X}}P(X \cap Y)\log(\frac{P(X \cap Y)}{P(X)P(Y)})$$

3. Implementation

There is useful library of scikit-learn. From now on, I'm trying to calculate mutual information with that library.

In [27]:
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import numpy as np
In [41]:
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_label = iris_dataset.target
In [13]:
# Expranatory variable
iris_data[0:3,:]
Out[13]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])
In [42]:
# Responsible variable
iris_label[0:3]
Out[42]:
array([0, 0, 0])

You can check value of mutual information with "mutual_info_classif" function.
2th and 3th expranatory variable seems to have higher value than others.

In [40]:
mutual_info_classif(X=iris_data,y=iris_label)
Out[40]:
array([ 0.48958131,  0.24431716,  0.98399648,  1.00119776])

Now you can obtain new expranatory variable which consists of high mutual information. Here I'm gonna extract 2 expranatory variable out of 4 with "SelectKBest" function.

In [21]:
selecter = SelectKBest(score_func=mutual_info_classif, k=2)
selecter_iris = selecter.fit(iris_data,iris_label)
In [45]:
new_iris_data = selecter_iris.fit_transform(iris_data,iris_label)
In [48]:
print('shape of new_iris_data',new_iris_data.shape)
new_iris_data[0:3,:]
shape of new_iris_data (150, 2)
Out[48]:
array([[ 1.4,  0.2],
       [ 1.4,  0.2],
       [ 1.3,  0.2]])

Now I can see explanatory variable with high mutual information were extracted :)
You can also check which explanatory variable is selected as True-False numpy array by invoking "get_support()" method.

In [49]:
support = selecter_iris.get_support()
print('support', support)
np.where(support == True)
support [False False  True  True]
Out[49]:
(array([2, 3]),)

You can check another "Select~" method here.
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
According the link above, for continuous variable, "mutual_info_regression" seems to be preferable.

Sunday, May 20, 2018

Fundamental understanding of "Logistic Regression"

Fundamental understanding of "Logistic Regression".

I've just looked back on "Logistic Regression". This is somewhat of memo for the time being:)

1. Basic logic behind the scene of "logistic regression".

In this article I will deal with only binary classification which is $C_1$, and $C_2$.
First of all, let us think about $p(C_1 | x)$,
$$\begin{eqnarray}p(C_1|x) &=& \frac{p(x|C_1)p(C_1)}{p(x)}\\ &=& \frac{p(x|C_1)p(C_1)}{p(x,C_1) + p(x, C_2)}\\ &=& \frac{p(x|C_1)p(C_1)}{p(x,C_1) + p(x, C_2)}\\ &=& \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x | C_2)p(C_2)}\end{eqnarray}$$
Following that, let $a$ be
$$a = \log\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$$
We call $\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)} = \frac{p(C_1|x)p(x)}{p(C_2|x)p(x)} = \frac{p(C_1|x)}{p(C_2|x)}$ as "odds". Logarithm of "Odds" is called as "log odds". Now, following is mathmaticaly trivial. $$e^{-a} = \frac{p(C_2|x)p(x)}{p(C_1|x)p(x)}$$
Therefore, $p(C_1|x)$ can be expressed as bellow $$p(C_1|x) = \frac{1}{1+ e^{-a}} = \sigma(a)$$
We can call the function $\sigma(a)$ as "Logistic Sigmoid Function". The "Inverse function" of "Logistic Sigmoid Function" is called "logit function".
$$a = \log \frac{\sigma(a)}{1 - \sigma(a)} = \log\frac{p(C_1|x)}{p(C_2|x)}$$ Let $w$ be $(w_0, w_1) $, $x$ be $(1, x)$, "Logistic regression" depict probability of $p(1|x)$ as
$$p(1|x) = \frac{1}{1 + \exp(w_0 + w_1x)}$$
Now let us assume $a$ is $w_0 + w_1x$, needless to say, this is "logistic sigmoid function". 

Note : This function is "non-linear function". In essense, <span style = "text-decoration:underline">we transform linear function $a$ to non-linear funciton with "logistic sigmoid function" !</span>
So, Let us think about "odds" with $w$ and $x$.
$$a = w^Tx = \log\frac{p(C_1|x)}{p(C_2|x)}$$
Hence,"odds" is expressed as, $$\frac{p(C_1|x)}{p(C_2|x)}=e^{w^Tx}$$

2. How the change with respect to $x$ affect to odds??

Now we think about how the change in $x$ affect to odds. Let us think about $\tilde{x} = (1, x+1)^T$, "odds ratio" would be,
$$\begin{eqnarray}\frac{\frac{p(C_1|\tilde{x})}{p(C_2|\tilde{x})}}{\frac{p(C_1|x)}{p(C_1|x)}} &=& \frac{exp(w^T\tilde{x})}{exp(w^T{x})} \\ &=& exp(w_1) \end{eqnarray}$$
Consequently, addition of 1 in x will change odds rate of $exp(w_1)$.

Wednesday, May 16, 2018

Why elementary row operation of Matrix doesn't change the rank of the Matrix ?? (Linear Algebra)

Why elementary row operation of Matrix doesn't change rank of the Matrix ?? (Mathematics)

"Elementary row operation" consists of three following activity.

  • Row switching
    A row within the matrix can be switched wtih another row.
    $R_{i} \leftrightarrow R_{j}$

  • Row multiplication
    Each element in a row can be multiplied by a non-zero constant.
    $kR_{i} \rightarrow R_{i}$ where $k \neq 0$

  • Row addition
    Row can be replaced by the sum of that row and a multiple another row.
    $R_{i} + R_{j} \rightarrow R_{i}$ where $i \neq j$

Source :https://en.wikipedia.org/wiki/Rank_(linear_algebra)

Now I'd like to get into why "Elementary row operation of Matrix" doesn't change "rank" of that matrix.
First of all, I have to make it clear what the "rank" of matirx mean ??
One of the definition of "rank" of matrix $A$ is :

Dimension of the image of the linear transformation that is given by multiplication by A.

Therefore, here I'd like to talk about why "Elementary row operation " doesn't change the dimention of image that is given by multiplication by A.
In a word or two, "Elementary row operation" is correspond to multiplication by "regular matrix" :) Take a look at 3 activeties.

Let $A$ be matrix as bellow. $$ \begin{pmatrix} a_{11} & a_{12} & a_{13} \\a_{21} & a_{22} & a_{23} \\a_{31} & a_{32} & a_{33} \end{pmatrix} $$

  1. Row switching
    It's equivalent to multiplication by identity matrix which interchange two rows as bellow.
    \begin{equation} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{31} & a_{32} & a_{33} \\ a_{21} & a_{22} & a_{23} \end{pmatrix} \end{equation}

    It goes without saying that determinant of regular matrix is 1.  

  2. Row multiplication
    It's equivalent to multiplication by identity matrix whose $ii$ element is multiplicated by constant $c$.
    \begin{equation} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ 3a_{21} & 3a_{22} & 3a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \end{equation}
    It goes without saying that determinant of regular matrix is -3.

  3. Row addition
    It's euivalent to multiplication by identity matrix whose $ij$ element is constant $c$.
    \begin{equation} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 3 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} + 3a_{31} & a_{22} + 3a_{32} & a_{23} + 3a_{33} \\ a_{31} & a_{32} & a_{33} \end{pmatrix} \end{equation} It goes without saying that determinant of regular matrix is -1.

When think about mapping by regular matrix, since regular matrix has inverse matrix, mapping would be bijection. Therefore dimention of image which given by regular matrix is same as kernel image.
Consequently, given that a regular matrix is $P$, dimention of image given by $A$ is correspond to dimention of image given by $PA$ as following image.

Sunday, May 13, 2018

Mathematical formula in calculus you must know for machine learning①

As you may know, for the sake of appropriate understanding towards "machine learning", mathematical knowledge is imperative. In this article, I will share with you some mathematical formula in calculus you must know.

1. Product rule of differentiation

"Product rule of differentiation" is one of ways of evaluating derivatives.
$$(f(x)g(x))' = f(x)g'(x) + f'(x)g(x)$$ You can derive this formula by followings with comparative ease.
From definitin of differentiation, $$ \begin{eqnarray} (f(x)g(x))' &=& \lim_{h\rightarrow 0}\frac{f(x+h)g(x+h) - f(x)g(x)}{h} \\ &=& \lim_{h\rightarrow 0}\frac{f(x+h)g(x+h) + f(x+h)g(x) - f(x+h)g(x) - f(x)g(x)}{h} \\ &=& \lim_{h\rightarrow 0}\frac{(g(x+h) - g(x))f(x+h)}{h} + \lim_{h\rightarrow 0}\frac{(f(x+h) -f(x))g(x)}{h}\end{eqnarray}$$ Now we assume $f(x)$ is differentiable, therefore, let alone, $f(x)$ is continuous. Consequetnly, $\lim_{h\rightarrow 0} f(x+h) = f(x)$. Hence,
$$ \begin{equation} (f(x)g(x))' = g'(x)f(x) + g(x)f'(x) \end{equation}$$

2. Quotient rule of differentiation

In calculus, quotient rule of differentiation is a method to find the derivative of a function that is the ratio of two different functions. source : Quotient rule
$$ \begin{equation} \left\{\frac{g(x)}{f(x)}\right\}' = \frac{g'(x) - f'(x)}{\{f(x)\}^2} \end{equation}$$ You can derive this formula as bellow with comparative ease. $$ \begin{eqnarray} (\frac{g(x)}{f(x)})' &=& \lim_{h\rightarrow 0}\frac{\frac{g(x+h)}{f(x+h)} - \frac{g(x)}{f(x)}}{h} \\ &=& \lim_{h\rightarrow 0}\frac{g(x+h)f(x) - g(x)f(x+h)}{f(x)f(x+h)h} \\ &=& \lim_{h\rightarrow 0}\frac{g(x+h)f(x) - f(x)g(x) + f(x)g(x) - g(x)f(x+h)}{f(x)f(x+h)h} \\ &=& \lim_{h\rightarrow 0}\left\{\frac{\left\{g(x+h) - g(x)\right\}f(x)}{h} - \frac{\left\{f(x+h) - f(x)\right\}g(x)}{h}\right\}\frac{1}{f(x)f(x+h)h}\\\end{eqnarray}$$
Since $f(x)$ is differentiable let alone continuous, $$ \begin{equation} \left\{\frac{g(x)}{f(x)}\right\}' = \frac{g'(x) - f'(x)}{\{f(x)\}^2} \end{equation}$$

3. Integration by parts

"Integration by parts" or "partial integration" is a process to find the integral of a product of function in terms of the intergral of their derivative and anti derivative. source : Integration by parts $$(f(x)g(x))' = f(x)G(x) - \int f'(x)G(x)dx \ \ \ \ where\ G(x) = \int g(x)dx$$
You can derive this formula by following with comparative ease.
From "product rule of calculus",
$$\begin{eqnarray} (f(x)G(x))' &=& f'(x)G(x) + f(x)G'(x) \\ &=& f'(x)G(x) + f(x)g(x) \end{eqnarray}$$ Therefore,    $$f(x)g(x) = (f(x)G(x))' - f'(x)G(x)$$ Take Integral on both sides, $$\begin{eqnarray} \int f(x)g(x)dx &=& \int(f(x)G(x))' - f'(x)G(x)dx \\ &=& \int(f(x)G(x))'dx - \int f(x)'G(x)dx \\ &=& f(x)G(x) - \int f(x)'G(x) dx\end{eqnarray} $$

4. L'Hôpital's rule

"L'Hôpital's rule" uses derivative to help evaluate limits involving indeterminant form. source : L'Hôpital's rule
If $\lim_{x \rightarrow c}f(x) = lim_{x \rightarrow c}g(x) = 0\ or\ \infty$,
$$\lim_{x \rightarrow c} \frac{g(x)}{f(x)} = \lim_{x \rightarrow c}\frac{g'(x)}{f'(x)}$$

Tuesday, May 8, 2018

Basic steps of "Bayesian Machine Learning" ①

Basic steps of "Bayesian Machine Learning" ①

Here I'll share basic steps of "Bayesian Machine Learning" . To put it simply, the steps of "Bayesian Machine Learning" consists of 4 steps as followings where $D$ is observed data and $\theta$ is unknown parameter.

  1. Create "model" of joint probability $p(D,\theta)$ as bellow.
    $$p(D,\theta) = p(D|\theta)p(\theta)$$
    This means setting how the data would be generated. $p(D|\theta)$ is called likelihood function.
  2. Apply prior distribution to $\theta$.
    If you apply idea of conjugate distribution(*1), calculation for posterior distribution and predictive distribution will be effortless.
    (*1) "Conjugate distribution" :
    If the posterior distribution $p(\theta|D)$ is in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior distribution are called conjugate distribution, and prior distribution is called "conjugate prior" for the likelihood function. sorce : Conjugate prior
  3. Compute posterior distribution $p(\theta|D)$.
    $$p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}$$
    In machine leaning, this process is called "training".
  4. Calculate predictive distribution.
    $p(x_*| D) = \int p(x_*|\theta)p(\theta|D) d\theta $ (*2)
    Incidentally, you can predict without any observed value as following.
    $p(x_*) = \int p(x_*|\theta)p(\theta) d\theta = \int p(x,\theta) d\theta$

    (*2)
    $$\begin{equation}p(x_*| D) = \int p(x_*,\theta|D) d\theta \\ =\int \frac{p(x_*, \theta,D)}{p(D)} d\theta \\ = \int \frac{p(x_*|\theta)p(D|\theta)p(\theta)}{p(D)}d\theta \\ = \int p(x_*|\theta)p(\theta|D) d\theta\end{equation} $$

Example of training with bernoulli distribution

1. Mathmatical analysis

In this example, I assume that liklihood function follows bernoulli distribution. In the case, parameter $\mu$ of bernoulli distribution is required to follow $\mu \in (0,1)$. Therefore adopting beta distribution as posterior distribution is understandable :)
Let us assume $X(x_1,x_2,・・・x_n)$ is observed, posterior distribution looks like as bellow. $$\begin{equation} p(\mu|D) = \frac{p(X|\mu)p(\mu)}{p(X)}\end{equation}$$
Since $X(x_1,x_2,・・・x_n)$ is independent, $$\begin{equation} p(\mu|D) = \frac{\Pi^{n}_{i=1}p(x_i | \mu)p(\mu)}{p(X)}\end{equation}$$
Take the logarithm for both sides,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1}\log p(x_i | \mu) + \log p(\mu) + const\end{equation}$$ Apply bernoulli distribution and beta distribution,
$$\begin{equation} p(\mu|D) = \Sigma^{n}_{i=1} \log \mu^{x_i}(1-\mu)^{(1-x_i)} + \log C_B (a,b)\mu^{a-1}(1-\mu)^{b-1} + const \\ = (a-1 +\Sigma_{i=1}^{n}x_i)\log\mu + (b-1+\Sigma_{i=1}^{n}(1-x_i))\log(1-\mu) + const\end{equation}$$
Then, compare with logarithm of Beta distribution.
$$\log Beta(x|a,b) = (a-1)logx +(b-1)log(1-x) + logC_B(a,b)$$
Therefore, $$p(\mu|D) = Beta(\mu|\hat{a},\hat{b})\ where\ \hat{a} = a + \Sigma_{i=1}^{n}x_i,\ \hat{b} = b+\Sigma_{i=1}^{n}(1-x_i) $$
From this equation, we can tell that Beta distribution of posterior distribution is updated, then frequency of 1 (ex. head in coin toss) was added to parameter "a", frequency of 0 (ex. tail in coin toss) was added to parameter b from observed data (likelihood function).

2. Implementation with Python 

Here I try to show how prior distribution is gonna be updated with observed data (likelihood function).
Regarding Bernoulli distribution and Beta distribution, there are useful library "scipy.stats.bernoulli" and "scipy.stats.beta" respectively.

In [1]:
from scipy.stats import bernoulli,beta
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
In [2]:
#Tentatively, let us assume prior distribution as Beta(x|1,1).
x = np.linspace(0,1,100)
beta_prior = beta(1,1)
y_prior = beta_prior.pdf(x)
plt.plot(x,y_prior)
Out[2]:
[<matplotlib.lines.Line2D at 0x11ce039b0>]

To sum it up, we don't have any preliminary information regarding event of bernoulli at all. How unpredictable the event is !
Let us assume that true parameter of bernoulli distribution of event is 0.8. After observing data, we'll see how the prior distribution will be change.

In [3]:
# create 10 data from bernoulli distribution with parameter is 0.8.
num_of_observe = [5,25,50,100]
for num in num_of_observe:
    data_ber = bernoulli.rvs(0.8, size=num,random_state=4)
    head_num = data_ber.sum()
    tail_num = data_ber.shape[0] - head_num
    beta_pos = beta(1+head_num,1+tail_num)
    y_pos = beta_pos.pdf(x)
    plt.plot(x,y_pos,label='{} observed'.format(num))
    
plt.legend()
plt.title('Transition of "perior distribution"')
Out[3]:
<matplotlib.text.Text at 0x11f390dd8>

You can see perior distribution approaches true parameter of 0.8 :)

Saturday, May 5, 2018

What is KL(Kullback-Leibler) divergence??

This article is regarding KL(Kullback-Leibler) divergence. You can check original jupyter notebook source in the follwing link.
https://github.com/hiroshiu12/mathematics/blob/master/kl_divergence.ipynb

KL(Kullback-Leibler) divergence

1. What is KL(Kullback-Leibler) divergence ?

"KL(Kullback-Leibler) divergence" is a way of comparing two different probability distribution. When I work on probability and statistic as a datascientist , in many cases, I'm force to approximate complex distribution or replace observed data. In the situation, we can measure how much we loose information by approximation with "KL(Kullback-Leibler) divergence".

Let us assume that there are two probability distribution $p(x)$, $q(x)$. "KL(Kullback-Leibler) divergence" is dnoted by following equation. $$\begin{eqnarray} KL[q(x)][p(x)] &=& -\int q(x)\log \frac{p(x)}{q(x)} \\ &=& \int q(x)\log q(x) dx - \int q(x)\log p(x) dx \\ &=& <\log q(x)>_{q(x)} - <\log p(x)>_{q(x)} \end{eqnarray}$$

At a glance, "KL(Kullback-Leibler) divergence" looks distance metric between two probability distribution. However, strictly speaking, it doesn't meet distance axiom "Symmetric"), since generally $KL[q(x)][p(x)] \neq KL[p(x)][q(x)]$. It's divergence not distance.

2. Compute divergence

Let's take a look at computing "KL(Kullback-Leibler) divergence".
This time I'm gonna compute two different normal distribution, one has mean of 0 and standard variation of 1, the other has mean of 0 and standard variation of 2.

In [1]:
import numpy as np
from scipy.stats import norm,entropy
import matplotlib.pyplot as plt
% matplotlib inline
In [2]:
# This time, I'm gonna compute divergence of following.
#   ・ normal distribution with mean =0 and standard deviation = 1
#   ・ normal distribution with mean =0 and standard deviation = 2
x = np.linspace(-5.0,5.0,100)
px_p = norm.pdf(x,0,1)
px_q = norm.pdf(x,0,2)

# Plot two different normal distribution 
plt.plot(x,px_p, label = "mean=0\nstandard deviation=1")
plt.plot(x,px_q, label = "mean=0\nstandard deviation=2")
plt.legend()
plt.title('Two different normal distribution')

# Compute KL(Kullback-Leibler) divergence
kl_divergence = entropy(px_q,px_p)
In [3]:
print('KL(Kullback-Leibler) :',kl_divergence)
KL(Kullback-Leibler) : 0.69243465135

As you can see, scipy.stats.entropy() can be used to compute "KL(Kullback-Leibler) divergence". With one argument, it's to return entropy. Whereas with two arguments, it's supposed to return "KL(Kullback-Leibler) divergence". It's quite useful :)