How to use tfidfvectorizer ?¶
"tf-idf" is a numerical statistic that is intended to reflect how important a word is to a document in a corpus.
source : https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- "tf" : the number of times each term occures in each document.
- "idf" : A measure of how much imformation the word provides.
There is really useful library which is "tfidfvectorizer" in scikit-learn to compute tf-idf value.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
Now I'm gonna compute "tf-idf" value of following corpus.
corpus = ['Seize the day!','Panic seized him','Police seized his device.']
# Call constructer of tfidfvectorizer
tfidf_ins = TfidfVectorizer()
tfidf_ins.fit(corpus)
# You can check "vocabulary" tfidfvectorizer holds.
tfidf_ins.get_feature_names()
Oops ! The word "seize" seems not to be normalized... Tf and idf of 'seize' and 'seized' should be computed as same words.
I think there are various type of normalization in Pyhon. However, this time, I specify vocabulary to constructor of corpus.
vocabulary = ['day','device','panic','police','seize','the']
# You can specify vocabulary in constructer
tfidf_ins2 = TfidfVectorizer(vocabulary=vocabulary)
print('vocabulary : ',tfidf_ins2.vocabulary)
tfidf_ins2.fit([document.replace('seized','seize') for document in corpus])
tfidf_ins2.get_feature_names()
tfidf_vect = tfidf_ins2.transform([document.replace('seized','seize') for
document in corpus])
tfidf_vect.toarray()
print('tfidf of "seize"\n',tfidf_vect[:,np.where(
np.array(tfidf_ins2.get_feature_names())=='seize')[0][0]])
print('tfidf of "the"\n',tfidf_vect[:,np.where(
np.array(tfidf_ins2.get_feature_names())=='the')[0][0]])
"tfidfvectorizer" normalize each vector.
As you can see, the word 'seize' appear every documents, therefore the value of tfidf is lowest in each document.
However although the word 'the' seems to be useless to recognize semantic analysis, it has quite high value in document 1.
Hence I wanna register this word as a "stop word" :)
Actually there is already frozenset of stop-word in "sklearn.feature_extraction.text.ENGLISH_STOP_WORDS."
from sklearn.feature_extraction import text
print('Number of presetted "ENGLISH_STOP_WORDS : ',
len(text.ENGLISH_STOP_WORDS))
# 'the' is already in "ENGLISH_STOP_WORDS"
'the' in text.ENGLISH_STOP_WORDS
When I wrote this cell, I notice that If you specify "vocabulary" in constructor, it seems to be prioritised than stop-word.
As a matter of fact, "the" was not excluded when vocabulary was specifid !
# Specify stop words to constructor
# If you specify "vocabulary" in constructor, it seems to be prioritised.
# As a matter of fact, "the" was not excluded when vocabulary
# was specifid.
tfidf_ins_stop = TfidfVectorizer(analyzer='word',
stop_words=text.ENGLISH_STOP_WORDS)
print('vocabulary : ',tfidf_ins_stop.vocabulary)
tfidf_ins_stop.fit([document.replace('seized','seize')
for document in corpus])
tfidf_ins_stop.get_feature_names()
tfidf_vect = tfidf_ins_stop.transform([document.replace('seized','seize')
for document in corpus])
tfidf_vect.toarray()
Now you can see understandable vectors.
No comments:
Post a Comment