Friday, April 13, 2018

Create Histogram

Create Histogram

"Histogram" is useful when I check dataset, more specifically, relation between explanatory variable and response variable.
Hence This is somewhat of memo of the way to create "Histogram"
Here, famaous and popular dataset "iris" is gonna be used.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
from sklearn.datasets import load_iris

First of all, check what the data is like.

In [2]:
iris_dataset = load_iris()

iris_data = iris_dataset.data
iris_target = iris_dataset.target
In [3]:
iris_data.shape
Out[3]:
(150, 4)
In [4]:
iris_data[0:5]
Out[4]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])
In [5]:
# Check the variation of target data
np.unique(iris_target)
Out[5]:
array([0, 1, 2])
In [6]:
iris_target.shape
Out[6]:
(150,)
In [7]:
iris_target[0:20]
Out[7]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Create histogram

In [8]:
# create bins for first element of list
var_0 = iris_data[:,0]
count,bins = np.histogram(var_0,bins=30)
In [9]:
# seperate explanatory variable into Setosa, Versicoiour  and Virginica
setosa = iris_data[iris_target ==0]
versicoiour = iris_data[iris_target ==1]
virginica = iris_data[iris_target ==2]
In [10]:
plt.hist(setosa[:,0],bins=bins,alpha=0.5)
plt.hist(versicoiour[:,0],bins=bins,alpha=0.5)
plt.hist(virginica[:,0],bins=bins,alpha=0.5)
Out[10]:
(array([ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  3.,
         1.,  2.,  4.,  6.,  5.,  4.,  0.,  7.,  3.,  0.,  1.,  4.,  1.,
         0.,  1.,  4.,  1.]),
 array([ 4.3 ,  4.42,  4.54,  4.66,  4.78,  4.9 ,  5.02,  5.14,  5.26,
         5.38,  5.5 ,  5.62,  5.74,  5.86,  5.98,  6.1 ,  6.22,  6.34,
         6.46,  6.58,  6.7 ,  6.82,  6.94,  7.06,  7.18,  7.3 ,  7.42,
         7.54,  7.66,  7.78,  7.9 ]),
 <a list of 30 Patch objects>)

Consequently, we can see it seems to be hard to discern the species.
Now I'd like to observe all variables.

In [11]:
fig,axes = plt.subplots(2,2,figsize=(12,12))
axes_1dim = axes.ravel()

for i in range(iris_data.shape[1]):
    count,bins = np.histogram(iris_data[:,i],bins=30)
    axes_1dim[i].hist(setosa[:,i],bins=bins,alpha=0.5)
    axes_1dim[i].hist(versicoiour[:,i],bins=bins,alpha=0.5)
    axes_1dim[i].hist(virginica[:,i],bins=bins,alpha=0.5)

According to result above, it seems setosa can be discerned by only third of fourth explanatory variables:)

No comments:

Post a Comment