Loading [MathJax]/jax/output/HTML-CSS/jax.js

Saturday, May 26, 2018

Mutual Information

Mutual Information

In this article, I will share with you regarding "Mutual Information". I believe this is one of crucial concept when working on data science, machine learning, deep learning and list goes on and on. I would be glad if you enjoy this article :)

1. Conditional Entropy

"What is Entropy" was discussed in the following link.
https://hiroshiu.blogspot.com/2018/04/what-is-entropy.html
Before getting into the "Mutual Information", we have to wrap our head around "Conditional Entropy".
"Conditional Entropy" H(x) of two discrete random variables X(x1,x2,,xn),Y(y1,y2,,ym) is captured in the followings H(X|y1)=ni=1P(xi|y1)log(P(xi|y1))
Therefore,
H(X|Y)=mj=1P(yj)ni=1P(Xi|yj)log(P(Xi|yj))=mj=1ni=1P(xiyj)log(P(xiyj)P(yj))

2. Mutual Information

Mutual Information is H(X)H(X|Y), which measures how much knowing one of vaiables reduce uncertainty of others. I(X;Y)=H(X)H(X|Y)=yYxXP(XY)log(P(XY)P(X)P(Y))

3. Implementation

There is useful library of scikit-learn. From now on, I'm trying to calculate mutual information with that library.

In [27]:
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import numpy as np
In [41]:
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_label = iris_dataset.target
In [13]:
# Expranatory variable
iris_data[0:3,:]
Out[13]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])
In [42]:
# Responsible variable
iris_label[0:3]
Out[42]:
array([0, 0, 0])

You can check value of mutual information with "mutual_info_classif" function.
2th and 3th expranatory variable seems to have higher value than others.

In [40]:
mutual_info_classif(X=iris_data,y=iris_label)
Out[40]:
array([ 0.48958131,  0.24431716,  0.98399648,  1.00119776])

Now you can obtain new expranatory variable which consists of high mutual information. Here I'm gonna extract 2 expranatory variable out of 4 with "SelectKBest" function.

In [21]:
selecter = SelectKBest(score_func=mutual_info_classif, k=2)
selecter_iris = selecter.fit(iris_data,iris_label)
In [45]:
new_iris_data = selecter_iris.fit_transform(iris_data,iris_label)
In [48]:
print('shape of new_iris_data',new_iris_data.shape)
new_iris_data[0:3,:]
shape of new_iris_data (150, 2)
Out[48]:
array([[ 1.4,  0.2],
       [ 1.4,  0.2],
       [ 1.3,  0.2]])

Now I can see explanatory variable with high mutual information were extracted :)
You can also check which explanatory variable is selected as True-False numpy array by invoking "get_support()" method.

In [49]:
support = selecter_iris.get_support()
print('support', support)
np.where(support == True)
support [False False  True  True]
Out[49]:
(array([2, 3]),)

You can check another "Select~" method here.
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
According the link above, for continuous variable, "mutual_info_regression" seems to be preferable.

No comments:

Post a Comment