Mutual Information¶
1. Conditional Entropy¶
"What is Entropy" was discussed in the following link.
https://hiroshiu.blogspot.com/2018/04/what-is-entropy.html
Before getting into the "Mutual Information", we have to wrap our head around "Conditional Entropy".
"Conditional Entropy" $H(x)$ of two discrete random variables $X(x_1,x_2,\cdots,x_n), Y(y_1,y_2,\cdots,y_m)$ is captured in the followings
$$H(X|y_{1}) = -\sum_{i=1}^{n}P(x_{i}|y_{1})\log(P(x_{i}|y_{1}))$$
Therefore,
\begin{eqnarray}
H(X|Y) &=& \sum_{j=1}^{m}P(y_{j})\sum_{i=1}^{n}P(X_{i}|y_{j})\log({P(X_{i}|y_{j})})\\
&=&\sum_{j=1}^{m}\sum_{i=1}^{n}P(x_{i} \cap y_{j})\log(\frac{P(x_{i}\cap y_{j})}{P(y_{j})})
\end{eqnarray}
2. Mutual Information¶
Mutual Information is $H(X) - H(X|Y)$, which measures how much knowing one of vaiables reduce uncertainty of others. $$I(X;Y) = H(X) - H(X|Y) = \sum_{y\in{Y}}\sum_{x\in{X}}P(X \cap Y)\log(\frac{P(X \cap Y)}{P(X)P(Y)})$$
3. Implementation¶
There is useful library of scikit-learn. From now on, I'm trying to calculate mutual information with that library.
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import numpy as np
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_label = iris_dataset.target
# Expranatory variable
iris_data[0:3,:]
# Responsible variable
iris_label[0:3]
You can check value of mutual information with "mutual_info_classif" function.
2th and 3th expranatory variable seems to have higher value than others.
mutual_info_classif(X=iris_data,y=iris_label)
Now you can obtain new expranatory variable which consists of high mutual information. Here I'm gonna extract 2 expranatory variable out of 4 with "SelectKBest" function.
selecter = SelectKBest(score_func=mutual_info_classif, k=2)
selecter_iris = selecter.fit(iris_data,iris_label)
new_iris_data = selecter_iris.fit_transform(iris_data,iris_label)
print('shape of new_iris_data',new_iris_data.shape)
new_iris_data[0:3,:]
Now I can see explanatory variable with high mutual information were extracted :)
You can also check which explanatory variable is selected as True-False numpy array by invoking "get_support()" method.
support = selecter_iris.get_support()
print('support', support)
np.where(support == True)
You can check another "Select~" method here.
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
According the link above, for continuous variable, "mutual_info_regression" seems to be preferable.
No comments:
Post a Comment