Monday, April 30, 2018

K-fold Cross Validation (scikit learn on Python3)

K-fold Cross validation

In my usual work as a datascientist, one of the headache I have is deficiency of labeled data. In that case, effective method against it is "cross validation" . Actually there are several methods in cross-validation.

  • K-fold cross-validation
  • Shuffle split cross-validation
  • Leave one out cross-validation

Here I'm gonna handle "K-fold cross-validation".

0. What is K-fold validation

When performing K-fold validation, first of all, all labeled data is partitioned into k parts of equal size which is called "fold". k is usually, specific number such as 5 or 10 and so on. Next, a sequence of model is trained. The first model is trained with first fold as test set, the remaining folds as train set. Then second model is built with secound fold as test set, the 1,3,4,5 folds as train sets. This precess is repeated using 3,4 and 5 fold as test set.

1. Benefit from k-fold validation

So far as I know, roughly speaking, there are three benefit obtained from k-fold validation instead of a single split into train and test data. 

  1. Each sample is supporsed to be in test set exactly onece, as each sample belongs to one of the folds, and each fold is the test set once. Therefore model is required to yild high cross validation score for all sample.
  2. Provides the information how sensitive our model is to the selection of the train and test set.
  3. We can use labeled data more effiectively than using single split of the data.

However you must the following in mind,

  • cross-validation is not the way to build a model that can be applied to new data.
  • cross-validation increase computational cost.

2. K-fold validation in scikit-learn

In [1]:
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
% matplotlib inline
In [2]:
cancer_dataset = load_breast_cancer()
cancer_data = cancer_dataset.data
cancer_target = cancer_dataset.target
In [3]:
# There are 569 samples
print('There are {} samples'.format(cancer_target.shape[0]))
print('There are {} datas of label "1"'.format(cancer_target.sum()))
There are 569 samples
There are 357 datas of label "1"

You can see "StratifiedKFold" offer dataset with same propotion as dataset.

In [4]:
# Tentatively k = 5.
cross_val = StratifiedKFold(n_splits=5)
for train,test in cross_val.split(cancer_data,cancer_target):
    print('Number of train data :',train.shape)
    print('Number of test data : ',test.shape)
    print('Number of 1 label in train data:', cancer_target[test].sum())
Number of train data : (454,)
Number of test data :  (115,)
Number of 1 label in train data: 72
Number of train data : (454,)
Number of test data :  (115,)
Number of 1 label in train data: 72
Number of train data : (456,)
Number of test data :  (113,)
Number of 1 label in train data: 71
Number of train data : (456,)
Number of test data :  (113,)
Number of 1 label in train data: 71
Number of train data : (456,)
Number of test data :  (113,)
Number of 1 label in train data: 71

Next, I'm gonna draw ROC curve of cross-validation

In [5]:
# Prepare model of logistic regression
logistic_model = LogisticRegression()

mean_fpr = np.linspace(0,1,100)
tprs = []
precisions = []
recalls = []
i=0

for train, test in cross_val.split(cancer_data,cancer_target):
    X_train = cancer_data[train]
    Y_train = cancer_target[train]
    X_test = cancer_data[test]
    Y_test = cancer_target[test]
    fitted_model = logistic_model.fit(X_train,Y_train)
    fpr,tpr,threshold = metrics.roc_curve(y_true=Y_test,
                                          y_score=fitted_model.predict_proba(X_test)[:,1])
    # So as to compute micro average of tpr,fpr, store them.
    tprs.append(np.interp(mean_fpr,fp=tpr,xp=fpr))
    
    roc_auc = metrics.auc(x=fpr,y=tpr)
    
    precisions.append(metrics.precision_score(y_true=Y_test,
                                              y_pred=fitted_model.predict(X_test)))
    recalls.append(metrics.recall_score(Y_test,
                                        fitted_model.predict(X_test)))
    plt.plot(fpr,tpr,alpha=0.3,label = 'ROC fold {} (AUC={:.2f})'.
             format(i,roc_auc))
    

# Compute 'micro' average of roc curve
micro_av_tprs = np.mean(tprs,axis=0)

plt.plot(mean_fpr,micro_av_tprs,label='Mean ROC',color='b')
plt.plot([0,1],[0,1],linestyle='--',color='r')
plt.legend(loc='lower right')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('ROC curve')
Out[5]:
<matplotlib.text.Text at 0x11d87eb70>

By default, LogisticRegression's threshold is 0.5. Hence following recall-precision matrix is based on threshold of 0.5.

In [6]:
df_pre_rec = pd.DataFrame([precisions,recalls]).T
df_pre_rec.columns = ['precision','recall']
df_pre_rec.head()
Out[6]:
precision recall
0 0.911392 1.000000
1 0.933333 0.972222
2 0.972222 0.985915
3 0.957746 0.957746
4 0.985507 0.957746

Monday, April 23, 2018

How to add user dictionary to MeCab??

When I analyze Japanese language, one of the biggest difference from English language is Morphological Analysis. Therefore for me, "Natural Language Processing(NLP)", "Topic Model" and "MeCab" goes hand-in-hand. Almost every time, I need user dictionary which apply for MeCab.
For instance, word "M&A" is separated into "M", "&", "A" as bellow.

$ echo 'M&A' | mecab
M 名詞,固有名詞,組織,*,*,*,*
& 名詞,サ変接続,*,*,*,*,*
A 名詞,固有名詞,組織,*,*,*,*
EOS
$ 

However that word is expected to be one word. Hence I'm gonna write the way to add user dictionary to MeCab down here for reminder for me.

1. Create csv file

At first, you need to prepare csv dictionary file.The format is following.
表層形,左文脈ID,右文脈ID,コスト,品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音
source : MeCab: 単語の追加方法
"表層形" & "コスト" seem to be required at minimum. And regarding "コスト", Smaller number is prioritized. Hence the word whose 'コスト' is 1 is most prioritized.
Tentatively I made user dictionary as followings,

$ cat ./dic_M\&A.csv 
M&A,,,1,名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
$ 

2. Create dictionary

First of all, directory for user dictionary should be created.

$ ls /usr/local/lib/mecab/dic/
ipadic
$ mkdir /usr/local/lib/mecab/dic/userdic
$ ls /usr/local/lib/mecab/dic/
ipadic  userdic

Now you can create dictionary with "mecab-dict-index" command as bellow.

$ /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic/ -u /usr/local/lib/mecab/dic/userdic/user.dic -f utf-i -t utf-8 "dic_M&A.csv"
reading dic_M&A.csv ... 1
emitting double-array: 100% |###########################################| 

done!
$ ls /usr/local/lib/mecab/dic/userdic/user.dic 
/usr/local/lib/mecab/dic/userdic/user.dic
$ cat /usr/local/lib/mecab/dic/userdic/user.dic 
H?q?f$$?
Gutf-8))R????R&名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,hiroshis-mbp:dictionary_for_Mecab uratah$ 

3. Register user dictionary to mecabrc

Unless user dictionary is registered to mecabrc, it won't become effective.

$ cp -p /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423
$ vi /usr/local/etc/mecabrc
$ diff /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423 
9d8
< userdic = /usr/local/lib/mecab/dic/userdic/user.dic
$ 

4. Check the behavior

Now is the time to check the behavior of MeCab which is applied user dictionary.

$ echo 'M&A' | mecab
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$ 

Postscript

User dictionary can be specified with -u option at the time of execution.

$ echo 'M&A' | mecab -u /usr/local/lib/mecab/dic/userdic/user.dic
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$ 

You can specify multiple user dictionary by concatenate with comma "," as following.

userdic = /usr/local/lib/mecab/dic/userdic/user.dic,/usr/local/lib/mecab/dic/userdic/user2.dic

Thursday, April 19, 2018

Tf-idf in Python (usage of tfidfvectorizer)

You can check original jupyter notebook source in follwing link. https://github.com/hiroshiu12/probability_statistic/blob/master/tfidfvectorizer.ipynb

How to use tfidfvectorizer ?

"tf-idf" is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. 
source : https://en.wikipedia.org/wiki/Tf%E2%80%93idf

  • "tf" : the number of times each term occures in each document.
  • "idf" : A measure of how much imformation the word provides.

There is really useful library which is "tfidfvectorizer" in scikit-learn to compute tf-idf value.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

Now I'm gonna compute "tf-idf" value of following corpus.

In [2]:
corpus = ['Seize the day!','Panic seized him','Police seized his device.']
In [3]:
# Call constructer of tfidfvectorizer
tfidf_ins = TfidfVectorizer()
tfidf_ins.fit(corpus)
# You can check "vocabulary" tfidfvectorizer holds.
tfidf_ins.get_feature_names()
Out[3]:
['day', 'device', 'him', 'his', 'panic', 'police', 'seize', 'seized', 'the']

Oops ! The word "seize" seems not to be normalized... Tf and idf of 'seize' and 'seized' should be computed as same words.
I think there are various type of normalization in Pyhon. However, this time, I specify vocabulary to constructor of corpus.

In [4]:
vocabulary = ['day','device','panic','police','seize','the']
# You can specify vocabulary in constructer
tfidf_ins2 = TfidfVectorizer(vocabulary=vocabulary)
print('vocabulary : ',tfidf_ins2.vocabulary)

tfidf_ins2.fit([document.replace('seized','seize') for document in corpus])
tfidf_ins2.get_feature_names()
vocabulary :  ['day', 'device', 'panic', 'police', 'seize', 'the']
Out[4]:
['day', 'device', 'panic', 'police', 'seize', 'the']
In [11]:
tfidf_vect = tfidf_ins2.transform([document.replace('seized','seize') for 
                                   document in corpus])
tfidf_vect.toarray()
Out[11]:
array([[ 0.65249088,  0.        ,  0.        ,  0.        ,  0.38537163,
         0.65249088],
       [ 0.        ,  0.        ,  0.861037  ,  0.        ,  0.50854232,
         0.        ],
       [ 0.        ,  0.65249088,  0.        ,  0.65249088,  0.38537163,
         0.        ]])
In [12]:
print('tfidf of "seize"\n',tfidf_vect[:,np.where(
    np.array(tfidf_ins2.get_feature_names())=='seize')[0][0]])
print('tfidf of "the"\n',tfidf_vect[:,np.where(
    np.array(tfidf_ins2.get_feature_names())=='the')[0][0]])
tfidf of "seize"
   (0, 0) 0.385371627466
  (1, 0) 0.508542320378
  (2, 0) 0.385371627466
tfidf of "the"
   (0, 0) 0.652490884513

"tfidfvectorizer" normalize each vector.
As you can see, the word 'seize' appear every documents, therefore the value of tfidf is lowest in each document.
However although the word 'the' seems to be useless to recognize semantic analysis, it has quite high value in document 1.
Hence I wanna register this word as a "stop word" :)
Actually there is already frozenset of stop-word in "sklearn.feature_extraction.text.ENGLISH_STOP_WORDS."

In [7]:
from sklearn.feature_extraction import text
In [8]:
print('Number of presetted "ENGLISH_STOP_WORDS : ',
      len(text.ENGLISH_STOP_WORDS))
# 'the' is already in "ENGLISH_STOP_WORDS"
'the' in text.ENGLISH_STOP_WORDS
Number of presetted "ENGLISH_STOP_WORDS :  318
Out[8]:
True

When I wrote this cell, I notice that If you specify "vocabulary" in constructor, it seems to be prioritised than stop-word.
As a matter of fact, "the" was not excluded when vocabulary was specifid !

In [9]:
# Specify stop words to constructor
# If you specify "vocabulary" in constructor, it seems to be prioritised.
# As a matter of fact, "the" was not excluded when vocabulary 
# was specifid.
tfidf_ins_stop = TfidfVectorizer(analyzer='word',
                                 stop_words=text.ENGLISH_STOP_WORDS)
print('vocabulary : ',tfidf_ins_stop.vocabulary)

tfidf_ins_stop.fit([document.replace('seized','seize') 
                    for document in corpus])
tfidf_ins_stop.get_feature_names()
vocabulary :  None
Out[9]:
['day', 'device', 'panic', 'police', 'seize']
In [10]:
tfidf_vect = tfidf_ins_stop.transform([document.replace('seized','seize') 
                                       for document in corpus])
tfidf_vect.toarray()
Out[10]:
array([[ 0.861037  ,  0.        ,  0.        ,  0.        ,  0.50854232],
       [ 0.        ,  0.        ,  0.861037  ,  0.        ,  0.50854232],
       [ 0.        ,  0.65249088,  0.        ,  0.65249088,  0.38537163]])

Now you can see understandable vectors.

Monday, April 16, 2018

How to enable log on Vyatta5600

Here is somewhat of reminder regarding how to set up log on Vyatta5600.

1. How to set log for firewall?


Type this command to activate log for default action of firewall.
    
        vyatta@vyatta# set security firewall name <firewall name> default-log
    
Configuraion looks like following.
     
 vyatta@vyatta1# show security firewall name <firewall name >
 name <firewall name> {
        default-action drop
+       default-log
        rule 1 {
                action accept
            ~ omitted bellow ~
    
Don't forget 'commit' and 'save'!

2. How to check log?


You can see log of default action of specified firewall.
     
vyatta@vyatta:~$ show log firewall name <firewall name>
    
Sometimes above command is a little bit boring since there is no movement. If you need dynamic, you can type bellow command and monitor log in real time.
     
vyatta@yatta:~$ monitor firewall name <firewall name>
    

Frequently used git command notes

Occasionally, I forgot the command or its functionality and checked it on GitPro. So, in order to avoid that I'm take a note of frequently used command here.
    
        $ git fetch origin
    

Explanation :

First of all, this command looks up which server "origin" is. Then fetches any data from it that you don't have yet, and updates your local database. For instance origin/master pointer to its new, more up-to-date position.

When? :

My coworker does something new to git server, like, push the modification, create new branch. When I notice there is no branch in remote-tracking branch shown as "origin/{branch name}" you friends said he created and pushed. Before quarreling, just simply type this command.
    
        $ git branch <new branchName> origin/<new branchName>
         set up to track remote branch DecisionSupport from origin.
        $
    

Explanation :

It creates new branch in local from remote-tracking branch. Nevertheless, It won't change you workspace.

When?? :

You fetched new branch from server and it's visible with " git branch -a ", yet invisible with " git branch". Consequently, you can check new branch in local space.
    
        $git checkout <branchName> 
        Switched to branch '<branchName>'
        Your branch is up-to-date with 'origin/<branchName>'.
    

Explanation :

It's change branch of your workplace to new branch.

When ? :

When you make up your mind to work on new branch.
    
        $git diff --name-only <commit number A> <commit number B>
    

Explanation :

It shows files which is added or deleted or modified between commit A and commit B.

When ? :

When you are desperate to know which files were changed between commit A and commit B.
    
        $git branch -d <branch name>
    

Explanation :

It deletes specified branch from local branch.

When ? :

You no longer have any feelings for that branch.
    
        $git checkout <branch name>
        $git merge origin/<branch name>
    

Explanation :

Reflect modification from remote-tracking branch into local branch tree.

When ? :

When you want to reflect modification on remote server into your local branch.
    
        $git log -1 origin /<branch name>
        $git log -1 HEAD 
    

Explanation :

First command shows you the commit of remote branch. It goes without saying that you should type 'git fetch origin' beforehand.
Second one shows commit of your local branch.
Incidentally, 'HEAD' means 'Pointer to the local branch you're currently work on.

When ? :

When you want to know whether there is progress in remote branch compare to local.

Friday, April 13, 2018

Create Histogram

Create Histogram

"Histogram" is useful when I check dataset, more specifically, relation between explanatory variable and response variable.
Hence This is somewhat of memo of the way to create "Histogram"
Here, famaous and popular dataset "iris" is gonna be used.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
from sklearn.datasets import load_iris

First of all, check what the data is like.

In [2]:
iris_dataset = load_iris()

iris_data = iris_dataset.data
iris_target = iris_dataset.target
In [3]:
iris_data.shape
Out[3]:
(150, 4)
In [4]:
iris_data[0:5]
Out[4]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])
In [5]:
# Check the variation of target data
np.unique(iris_target)
Out[5]:
array([0, 1, 2])
In [6]:
iris_target.shape
Out[6]:
(150,)
In [7]:
iris_target[0:20]
Out[7]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Create histogram

In [8]:
# create bins for first element of list
var_0 = iris_data[:,0]
count,bins = np.histogram(var_0,bins=30)
In [9]:
# seperate explanatory variable into Setosa, Versicoiour  and Virginica
setosa = iris_data[iris_target ==0]
versicoiour = iris_data[iris_target ==1]
virginica = iris_data[iris_target ==2]
In [10]:
plt.hist(setosa[:,0],bins=bins,alpha=0.5)
plt.hist(versicoiour[:,0],bins=bins,alpha=0.5)
plt.hist(virginica[:,0],bins=bins,alpha=0.5)
Out[10]:
(array([ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  3.,
         1.,  2.,  4.,  6.,  5.,  4.,  0.,  7.,  3.,  0.,  1.,  4.,  1.,
         0.,  1.,  4.,  1.]),
 array([ 4.3 ,  4.42,  4.54,  4.66,  4.78,  4.9 ,  5.02,  5.14,  5.26,
         5.38,  5.5 ,  5.62,  5.74,  5.86,  5.98,  6.1 ,  6.22,  6.34,
         6.46,  6.58,  6.7 ,  6.82,  6.94,  7.06,  7.18,  7.3 ,  7.42,
         7.54,  7.66,  7.78,  7.9 ]),
 <a list of 30 Patch objects>)

Consequently, we can see it seems to be hard to discern the species.
Now I'd like to observe all variables.

In [11]:
fig,axes = plt.subplots(2,2,figsize=(12,12))
axes_1dim = axes.ravel()

for i in range(iris_data.shape[1]):
    count,bins = np.histogram(iris_data[:,i],bins=30)
    axes_1dim[i].hist(setosa[:,i],bins=bins,alpha=0.5)
    axes_1dim[i].hist(versicoiour[:,i],bins=bins,alpha=0.5)
    axes_1dim[i].hist(virginica[:,i],bins=bins,alpha=0.5)

According to result above, it seems setosa can be discerned by only third of fourth explanatory variables:)