You can access original code with following link. https://github.com/hiroshiu12/probability_statistic/blob/master/How_to_use_gensim.ipynb"
How to use gensim to obtain "bag-of-words" representation??¶
This notebook is just practice of https://radimrehurek.com/gensim/tut1.html. And reminder for me :)
from gensim import corpora
from gensim import matutils
import re
corpus = ["Membership of the club has dwindled from 70 to 20",
"They tried to buffer themselves against problems and unvertainties",
"I don't want to be just a cog in the wheel anymore"]
1. First of all, you have to tokenize corpus.¶
def simple_toknizer(corpus):
"""
Parameter :
-------------------
corpus : corpus should be corpus, list of sentences.
"""
token_list = []
for sentence in corpus:
token_list.append(sentence.split(' '))
return token_list
token_list = simple_toknizer(corpus)
print(token_list)
2. Next you have to create dictionary.¶
"dictionary" is mapping between token and ids.
dictionary = corpora.Dictionary(token_list)
# You can check the mapping by caling 'token2id' attribute.
dictionary.token2id
Note :
You can mechanically filter some words out with 'filter_extremes' and 'fileter_n_most_frequent' methods.
Or you can specificly filter some words out with 'filter_tokens'.
As an example, filter numeric words from dictionary. They sometimes disrupt the model of machine learning or cluster..
# You must know the ids of word you want omit.
regular_exp = re.compile('\d+')
ids_list = []
for word, number in dictionary.token2id.items():
if regular_exp.match(word):
ids_list.append(number)
print('The number you wanna filter out is : ',ids_list)
Caution ! :¶
- "filter_tokens" method modify mapping of itself.
- It changes ids of dictionary.
dictionary.filter_tokens(bad_ids=ids_list)
# ids were changed.
dictionary.token2id
3. Now is the time to create bag of words representation with sparse vector.¶
Contents of sparse vector is tuple which is (word id, the number of occurrence)
sparse_vector = [dictionary.doc2bow(tokens) for tokens in token_list]
sparse_vector
4. gensim has useful uitility to make dense vector.¶
You can simply use matutils.corpus2dense to obtain dense vector, however don't forget let it know the number of dimention. Since dimentionality cannot be deduced from sparse vector.
dense_vector= matutils.corpus2dense(sparse_vector,num_terms=len(dictionary.token2id))
dense_vector
However, mostly we use transposed version of that dense vector for machine learning or cluster or etc...
Therefore transposed vector is more useful :)
dense_vector.T
this blog is very good.sharing more like this type of blog.many important points are there.thank you.
ReplyDeletePython Classes in Chennai
Python Training Institute in Chennai
ccna Training institute in Chennai
ccna institute in Chennai
Amazon web services Training in Chennai
Python Training in Anna nagar
Python Training in T nagar
it is excellent blogs...!!
ReplyDeleteCyber Security Training Course in Chennai | Certification | Cyber Security Online Training Course | Ethical Hacking Training Course in Chennai | Certification | Ethical Hacking Online Training Course |
CCNA Training Course in Chennai | Certification | CCNA Online Training Course | RPA Robotic Process Automation Training Course in Chennai | Certification | RPA Training Course Chennai | SEO Training in Chennai | Certification | SEO Online Training Course