Sunday, July 8, 2018

Decision Tree

In this article, I'll share with you the fundamental understanding of "Decision Tree" machine learning. "Decision Tree" machine learning is advantageous for "interpretability". However my ulterior motive to learn "Desicion Tree" machine learning is understanding "Random Forest" which is one of "ensemble learning" afterwards.

0. Introduction to "Decision Tree"

Basically, we start at the tree root and split the data into two child nodes. In an iterative process, we can repeat this splitting procedure at each child node until the leaves are pure. This means the sample at each node all belongs to the same class. However, in practice, it can easily lead to overfitting. Hence "pruning" is required. We will talk about it later.

1. Informaton gain

When splitting the data at each node, we have to consider which feature should be used. Needless to say, we wanna put informative feature as much as possible. For the sake of it, we need objective function to optimaize. In the context of "Decision Tree", "Information gain" would be the one defined as following.

$$ IG(D_p) = I(D_p) - \left\{\frac{N_{left}}{N_{p}}I(D_{left}) + \frac{N_{right}}{N_{p}}I(D_{right})\right\}\tag{1}$$

where $D_p$ is dataset at the parenet node, $D_{left}$ and $D_{right}$ are the dataset at child node. $N_p$ is the total number of samples at the parent node, $N_{right}$ and $N_{left}$ is the number of samples after splitted respectively. Then $I()$ is the inpurity measure. As we can see, the information gain is simply the differece between the impurity of parent node and the sum of impurity of the child node.
For impurity measure, I will show you two representative measure. one is "Gini impurity" and the other is "Entropy".

2. Gini impurity

"Ginni impurity" is a measure of likelihood of classification if new data is calssified according to the distibution which is generated by the number of class from dataset.
Supporse, $J$ is the number of classes, let $i$ be $i \in \{1,2,\cdots, J\}$, $p_i$ be proportion of class $i$ in the dataset. Gini impurity $I_G(p)$ can be depicted as, $$\begin{eqnarray}I_G(p) &=& \sum_{i=1}^J p(i)(1-p(i)) \\ &=& \sum_{i=1}^J p(i)-p(i)^2\\ &=& \sum_{i=1}^J p(i)- \sum_{i=1}^Jp(i)^2\\ &=& 1- \sum_{i=1}^Jp(i)^2\\ \end{eqnarray}$$

3. Entropy

On the other hand, "entorpy" describe unpredictability of event. To be precise, it's the expectation of Self-information. $$I_H = -\sum_i^J p_i(x)\log p_i(x)$$ If detail of Entopy is on your mind, you can check article titled What is Entropy?

4. Which impurity measure is preferable??

From experience, both Gini impurity and entropy yield very similar results. Therefore it's not clever to take time for comparison between them. I rather recommend to concentrate on "pruning". In terms of computation, ginni impurity is less computationaly intensive than Entropy, as log calculation is not needed.

5. Implementation

A picture is worth a thousand words. In this section, I will implement and visualize how decision tree separate the region and classify data. As a dataset and decision tree,iris datase and sklearn will be utilized.

In [66]:
from sklearn import datasets
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
% matplotlib inline
from matplotlib import cm
In [22]:
# Preparation of data
data = iris.data
target = iris.target

# Narrow target down to 0 or 1.
mask01 =  np.logical_or(target ==0,target==1)
target = target[mask01]
data = data[mask01]

# Narrow explanatory variable down to 
# sepal length and sepal width.
data = data[:,0:2]
In [102]:
# data for x axis and y axis
length = data[:,0]
width = data[:,1]

x_min = length.min() -1
x_max = length.max() + 1
y_min = width.min() - 1
y_max = width.max() + 1

x = np.linspace(x_min,x_max,100)
y = np.linspace(y_min,y_max,100)

xx,yy = np.meshgrid(x,y)

# Number of max depth
depth_list = np.array([1,2,3,4])
plot_tuple = depth_list.reshape(-1,2).shape

# Preparation of multiple plot
fig = plt.figure(figsize=(13,12))
fig.suptitle('Decision Tree with {} max depth'.format(depth_list.shape[0]),
                                                                                     fontsize = 18)

for i, depth in enumerate(depth_list):

    # decision tree with 3 node at maximum
    tree = DecisionTreeClassifier(criterion='gini',max_depth=depth,
                                                      random_state=3)

    # Construct tree
    tree.fit(data,target)
    z = tree.predict(np.c_[(xx.ravel(),yy.ravel())])

    # Region of prediction.
    zz = z.reshape(xx.shape)

    axes = fig.add_subplot(plot_tuple[0],plot_tuple[1],i+1)
    
    # plot region
    axes.contourf(xx,yy,zz,cmap=cm.Accent_r)

    # plot data
    axes.scatter(length[target==0],width[target==0],c='white',marker='^')
    axes.scatter(length[target==1],width[target==1],c='red',marker='o')
    axes.set_xlabel('sepal length')
    axes.set_ylabel('sepal width')
    axes.set_title('max_depth = {}'.format(depth),fontsize=16)

1 comment:

  1. Decision trees are a great flow chart tree structuecire.Yet decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
    To understand further more lets look at some Decision Tree Examples in the Creately diagram community.

    ReplyDelete