Mutual Information¶
1. Conditional Entropy¶
"What is Entropy" was discussed in the following link.
https://hiroshiu.blogspot.com/2018/04/what-is-entropy.html
Before getting into the "Mutual Information", we have to wrap our head around "Conditional Entropy".
"Conditional Entropy" H(x) of two discrete random variables X(x1,x2,⋯,xn),Y(y1,y2,⋯,ym) is captured in the followings
H(X|y1)=−n∑i=1P(xi|y1)log(P(xi|y1))
Therefore,
H(X|Y)=m∑j=1P(yj)n∑i=1P(Xi|yj)log(P(Xi|yj))=m∑j=1n∑i=1P(xi∩yj)log(P(xi∩yj)P(yj))
2. Mutual Information¶
Mutual Information is H(X)−H(X|Y), which measures how much knowing one of vaiables reduce uncertainty of others. I(X;Y)=H(X)−H(X|Y)=∑y∈Y∑x∈XP(X∩Y)log(P(X∩Y)P(X)P(Y))
3. Implementation¶
There is useful library of scikit-learn. From now on, I'm trying to calculate mutual information with that library.
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
import numpy as np
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_label = iris_dataset.target
# Expranatory variable
iris_data[0:3,:]
# Responsible variable
iris_label[0:3]
You can check value of mutual information with "mutual_info_classif" function.
2th and 3th expranatory variable seems to have higher value than others.
mutual_info_classif(X=iris_data,y=iris_label)
Now you can obtain new expranatory variable which consists of high mutual information. Here I'm gonna extract 2 expranatory variable out of 4 with "SelectKBest" function.
selecter = SelectKBest(score_func=mutual_info_classif, k=2)
selecter_iris = selecter.fit(iris_data,iris_label)
new_iris_data = selecter_iris.fit_transform(iris_data,iris_label)
print('shape of new_iris_data',new_iris_data.shape)
new_iris_data[0:3,:]
Now I can see explanatory variable with high mutual information were extracted :)
You can also check which explanatory variable is selected as True-False numpy array by invoking "get_support()" method.
support = selecter_iris.get_support()
print('support', support)
np.where(support == True)
You can check another "Select~" method here.
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
According the link above, for continuous variable, "mutual_info_regression" seems to be preferable.
No comments:
Post a Comment