Tuesday, April 10, 2018

How to use gensim to obtain "bag-of-words" representation??

This is somewhat of memo to obtain "bag-of-words" representation by using gensim library.
You can access original code with following link. https://github.com/hiroshiu12/probability_statistic/blob/master/How_to_use_gensim.ipynb"

How to use gensim to obtain "bag-of-words" representation??

This notebook is just practice of https://radimrehurek.com/gensim/tut1.html. And reminder for me :)

In [36]:
from gensim import corpora
from gensim import matutils
import re
In [15]:
corpus = ["Membership of the club has dwindled from 70 to 20",
         "They tried to buffer themselves against problems and unvertainties",
         "I don't want to be just a cog in the wheel anymore"]

1. First of all, you have to tokenize corpus.

In [16]:
def simple_toknizer(corpus):
    """
    Parameter :
    -------------------
    corpus :  corpus should be corpus, list of sentences.
    """
    token_list = []
    for sentence in corpus:
        token_list.append(sentence.split(' '))
        
    return token_list
In [17]:
token_list = simple_toknizer(corpus)
print(token_list)
[['Membership', 'of', 'the', 'club', 'has', 'dwindled', 'from', '70', 'to', '20'], 
['They', 'tried', 'to', 'buffer', 'themselves', 'against', 'problems', 'and', 'unvertainties'],
 ['I', "don't", 'want', 'to', 'be', 'just', 'a', 'cog', 'in', 'the', 'wheel', 'anymore']]

2. Next you have to create dictionary.

"dictionary" is mapping between token and ids.

In [18]:
dictionary = corpora.Dictionary(token_list)
# You can check the mapping by caling 'token2id' attribute.
dictionary.token2id
Out[18]:
{'20': 9,
 '70': 7,
 'I': 18,
 'Membership': 0,
 'They': 10,
 'a': 23,
 'against': 14,
 'and': 16,
 'anymore': 27,
 'be': 21,
 'buffer': 12,
 'club': 3,
 'cog': 24,
 "don't": 19,
 'dwindled': 5,
 'from': 6,
 'has': 4,
 'in': 25,
 'just': 22,
 'of': 1,
 'problems': 15,
 'the': 2,
 'themselves': 13,
 'to': 8,
 'tried': 11,
 'unvertainties': 17,
 'want': 20,
 'wheel': 26}

Note :
You can mechanically filter some words out with 'filter_extremes' and 'fileter_n_most_frequent' methods.
Or you can specificly filter some words out with 'filter_tokens'.

As an example, filter numeric words from dictionary. They sometimes disrupt the model of machine learning or cluster..

In [26]:
# You must know the ids of word you want omit.
regular_exp = re.compile('\d+')
ids_list = []
for word, number in dictionary.token2id.items():
    if regular_exp.match(word):
        ids_list.append(number)

print('The number you wanna filter out is : ',ids_list)
The number you wanna filter out is :  [7, 9]

Caution ! :

  1. "filter_tokens" method modify mapping of itself.
  2. It changes ids of dictionary.
In [27]:
dictionary.filter_tokens(bad_ids=ids_list)
In [29]:
# ids were changed.
dictionary.token2id
Out[29]:
{'I': 16,
 'Membership': 0,
 'They': 8,
 'a': 21,
 'against': 12,
 'and': 14,
 'anymore': 25,
 'be': 19,
 'buffer': 10,
 'club': 3,
 'cog': 22,
 "don't": 17,
 'dwindled': 5,
 'from': 6,
 'has': 4,
 'in': 23,
 'just': 20,
 'of': 1,
 'problems': 13,
 'the': 2,
 'themselves': 11,
 'to': 7,
 'tried': 9,
 'unvertainties': 15,
 'want': 18,
 'wheel': 24}

3. Now is the time to create bag of words representation with sparse vector.

Contents of sparse vector is tuple which is (word id, the number of occurrence)

In [35]:
sparse_vector = [dictionary.doc2bow(tokens) for tokens in token_list] 
sparse_vector
Out[35]:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1)],
 [(2, 1),
  (7, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1)]]

4. gensim has useful uitility to make dense vector.

You can simply use matutils.corpus2dense to obtain dense vector, however don't forget let it know the number of dimention. Since dimentionality cannot be deduced from sparse vector.

In [40]:
dense_vector= matutils.corpus2dense(sparse_vector,num_terms=len(dictionary.token2id))
dense_vector
Out[40]:
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 0.,  0.,  1.]], dtype=float32)

However, mostly we use transposed version of that dense vector for machine learning or cluster or etc...
Therefore transposed vector is more useful :)

In [42]:
dense_vector.T
Out[42]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,
         1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]], dtype=float32)

2 comments: