Thursday, April 19, 2018

Tf-idf in Python (usage of tfidfvectorizer)

You can check original jupyter notebook source in follwing link. https://github.com/hiroshiu12/probability_statistic/blob/master/tfidfvectorizer.ipynb

How to use tfidfvectorizer ?

"tf-idf" is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. 
source : https://en.wikipedia.org/wiki/Tf%E2%80%93idf

  • "tf" : the number of times each term occures in each document.
  • "idf" : A measure of how much imformation the word provides.

There is really useful library which is "tfidfvectorizer" in scikit-learn to compute tf-idf value.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

Now I'm gonna compute "tf-idf" value of following corpus.

In [2]:
corpus = ['Seize the day!','Panic seized him','Police seized his device.']
In [3]:
# Call constructer of tfidfvectorizer
tfidf_ins = TfidfVectorizer()
tfidf_ins.fit(corpus)
# You can check "vocabulary" tfidfvectorizer holds.
tfidf_ins.get_feature_names()
Out[3]:
['day', 'device', 'him', 'his', 'panic', 'police', 'seize', 'seized', 'the']

Oops ! The word "seize" seems not to be normalized... Tf and idf of 'seize' and 'seized' should be computed as same words.
I think there are various type of normalization in Pyhon. However, this time, I specify vocabulary to constructor of corpus.

In [4]:
vocabulary = ['day','device','panic','police','seize','the']
# You can specify vocabulary in constructer
tfidf_ins2 = TfidfVectorizer(vocabulary=vocabulary)
print('vocabulary : ',tfidf_ins2.vocabulary)

tfidf_ins2.fit([document.replace('seized','seize') for document in corpus])
tfidf_ins2.get_feature_names()
vocabulary :  ['day', 'device', 'panic', 'police', 'seize', 'the']
Out[4]:
['day', 'device', 'panic', 'police', 'seize', 'the']
In [11]:
tfidf_vect = tfidf_ins2.transform([document.replace('seized','seize') for 
                                   document in corpus])
tfidf_vect.toarray()
Out[11]:
array([[ 0.65249088,  0.        ,  0.        ,  0.        ,  0.38537163,
         0.65249088],
       [ 0.        ,  0.        ,  0.861037  ,  0.        ,  0.50854232,
         0.        ],
       [ 0.        ,  0.65249088,  0.        ,  0.65249088,  0.38537163,
         0.        ]])
In [12]:
print('tfidf of "seize"\n',tfidf_vect[:,np.where(
    np.array(tfidf_ins2.get_feature_names())=='seize')[0][0]])
print('tfidf of "the"\n',tfidf_vect[:,np.where(
    np.array(tfidf_ins2.get_feature_names())=='the')[0][0]])
tfidf of "seize"
   (0, 0) 0.385371627466
  (1, 0) 0.508542320378
  (2, 0) 0.385371627466
tfidf of "the"
   (0, 0) 0.652490884513

"tfidfvectorizer" normalize each vector.
As you can see, the word 'seize' appear every documents, therefore the value of tfidf is lowest in each document.
However although the word 'the' seems to be useless to recognize semantic analysis, it has quite high value in document 1.
Hence I wanna register this word as a "stop word" :)
Actually there is already frozenset of stop-word in "sklearn.feature_extraction.text.ENGLISH_STOP_WORDS."

In [7]:
from sklearn.feature_extraction import text
In [8]:
print('Number of presetted "ENGLISH_STOP_WORDS : ',
      len(text.ENGLISH_STOP_WORDS))
# 'the' is already in "ENGLISH_STOP_WORDS"
'the' in text.ENGLISH_STOP_WORDS
Number of presetted "ENGLISH_STOP_WORDS :  318
Out[8]:
True

When I wrote this cell, I notice that If you specify "vocabulary" in constructor, it seems to be prioritised than stop-word.
As a matter of fact, "the" was not excluded when vocabulary was specifid !

In [9]:
# Specify stop words to constructor
# If you specify "vocabulary" in constructor, it seems to be prioritised.
# As a matter of fact, "the" was not excluded when vocabulary 
# was specifid.
tfidf_ins_stop = TfidfVectorizer(analyzer='word',
                                 stop_words=text.ENGLISH_STOP_WORDS)
print('vocabulary : ',tfidf_ins_stop.vocabulary)

tfidf_ins_stop.fit([document.replace('seized','seize') 
                    for document in corpus])
tfidf_ins_stop.get_feature_names()
vocabulary :  None
Out[9]:
['day', 'device', 'panic', 'police', 'seize']
In [10]:
tfidf_vect = tfidf_ins_stop.transform([document.replace('seized','seize') 
                                       for document in corpus])
tfidf_vect.toarray()
Out[10]:
array([[ 0.861037  ,  0.        ,  0.        ,  0.        ,  0.50854232],
       [ 0.        ,  0.        ,  0.861037  ,  0.        ,  0.50854232],
       [ 0.        ,  0.65249088,  0.        ,  0.65249088,  0.38537163]])

Now you can see understandable vectors.

No comments:

Post a Comment