Thursday, February 28, 2019

Introduction to LSI (Latent Semantic Indexing)

In this article, I will share about basics of LSI (Latent Semantic Indexing). I will be glad if you enjoy this article :)

0. Data

Let's say we have data as bag of words representation as following. We can tell doc_1 and doc_2 are somehow similar, doc_3 and doc_4 have similar character. Actually, the former topic is "tennis" and the latter is "soccer".

In [1]:
import numpy as np
import pandas as pd
import numpy.linalg as LA
np.set_printoptions(precision=2)
In [2]:
bow = np.array([[1,1,0,0,0],
                         [1,0,1,0,0],
                         [0,0,0,1,1],
                         [0,0,0,0,1]])
pd.DataFrame(bow,
             columns=['heat','Osaka','Naomi','Japan','PK'],
             index=['doc_1','doc_2','doc_3','doc_4'])
Out[2]:
heat Osaka Naomi Japan PK
doc_1 1 1 0 0 0
doc_2 1 0 1 0 0
doc_3 0 0 0 1 1
doc_4 0 0 0 0 1

1. Singular value decomposition

Apply singlar value decomposition.

In [3]:
U, sigma, V = LA.svd(bow,full_matrices=True)
print('Shape of U :', U.shape)
print('U\n',U)
print('Shape of sigma :',sigma.shape)
print('sigma\n',sigma)
print('Shape of V :',V.T.shape)
print('V\n',V.T)
Shape of U : (4, 4)
U
 [[ 0.71  0.   -0.71  0.  ]
 [ 0.71  0.    0.71  0.  ]
 [ 0.    0.85  0.   -0.53]
 [ 0.    0.53  0.    0.85]]
Shape of sigma : (4,)
sigma
 [1.73 1.62 1.   0.62]
Shape of V : (5, 5)
V
 [[ 8.16e-01  6.41e-17  3.33e-16  4.81e-17  5.77e-01]
 [ 4.08e-01 -6.41e-17 -7.07e-01 -4.81e-17 -5.77e-01]
 [ 4.08e-01 -6.41e-17  7.07e-01 -4.81e-17 -5.77e-01]
 [-0.00e+00  5.26e-01  0.00e+00 -8.51e-01  4.44e-16]
 [ 0.00e+00  8.51e-01  0.00e+00  5.26e-01 -1.11e-16]]

Actually, the number of singular value is only 4, therefore 5th colum of $V$ doesn't do anything. We can confirm by composing original bow without 5th column.

In [4]:
U @ np.diag(sigma) @ V[0:4]
Out[4]:
array([[ 1.00e+00,  1.00e+00, -2.90e-16,  0.00e+00,  0.00e+00],
       [ 1.00e+00, -6.84e-17,  1.00e+00,  0.00e+00,  0.00e+00],
       [ 7.26e-17, -7.26e-17, -7.26e-17,  1.00e+00,  1.00e+00],
       [ 7.98e-17, -7.98e-17, -7.98e-17, -1.02e-16,  1.00e+00]])

So, let's say $W = U\Sigma$, $H = V^T$, we can decompose bow as $X = WH$.

2. Analysis

In [5]:
W = U @ np.diag(sigma)
H = V[:-1]
print('W = \n',W)
print('H = \n',H)
W = 
 [[ 1.22  0.   -0.71  0.  ]
 [ 1.22  0.    0.71  0.  ]
 [ 0.    1.38  0.   -0.32]
 [ 0.    0.85  0.    0.53]]
H = 
 [[ 8.16e-01  4.08e-01  4.08e-01 -0.00e+00  0.00e+00]
 [ 6.41e-17 -6.41e-17 -6.41e-17  5.26e-01  8.51e-01]
 [ 3.33e-16 -7.07e-01  7.07e-01  0.00e+00  0.00e+00]
 [ 4.81e-17 -4.81e-17 -4.81e-17 -8.51e-01  5.26e-01]]

Let $X$ be the matrix represents original bow matrix. The row vector $x_i$ of $X$ can be depicted as below.

$$x_i = \Sigma_j w_{ij}h_j \tag{1}$$

where $h_j$ is row vector of $H$. From equation (1), we can see each row vector of $H$ as orthonormal basis. And row vector of $W$ are seen as coefficient. Now we can take a look at the result of matrix decomposition. It seems that the first row vector of $V$ represent the "tennis" and second row vector imply "soccer". Hence first row vector of $W$, which represents coefficient of first row vector of $X$, has large number on $w_{11}$, in contrast, $w_{12}$ has ZERO coefficient. As for third row vector of $W$, which represents coefficient of third row vector of $X$, has ZERO on $w_{31}$ whereas $w_{32} has positive number.

3. Rank Deduction

For the next, let me try to deduct rank of matrix.

In [25]:
print('original rank :', LA.matrix_rank(bow))
sigma_rank3 = np.diag(np.concatenate([sigma[:3],np.array([0])]))
print('new sigma (1 sigular value deleted):\n',sigma_rank3)
print('Compose bow with 1 rank deduction\n',U@sigma_rank3@V[0:4] )
sigma_rank2 = np.diag(np.concatenate([sigma[:2],np.array([0,0])]))
print('new sigma (2 sigular value deleted):\n',sigma_rank2)
print('Compose bow with 2 rank deduction\n',U@sigma_rank2@V[0:4] )
original rank : 4
new sigma (1 sigular value deleted):
 [[1.73 0.   0.   0.  ]
 [0.   1.62 0.   0.  ]
 [0.   0.   1.   0.  ]
 [0.   0.   0.   0.  ]]
Compose bow with 1 rank deduction
 [[ 1.00e+00  1.00e+00 -2.90e-16  0.00e+00  0.00e+00]
 [ 1.00e+00 -6.84e-17  1.00e+00  0.00e+00  0.00e+00]
 [ 8.82e-17 -8.82e-17 -8.82e-17  7.24e-01  1.17e+00]
 [ 5.45e-17 -5.45e-17 -5.45e-17  4.47e-01  7.24e-01]]
new sigma (2 sigular value deleted):
 [[1.73 0.   0.   0.  ]
 [0.   1.62 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]]
Compose bow with 2 rank deduction
 [[ 1.00e+00  5.00e-01  5.00e-01  0.00e+00  0.00e+00]
 [ 1.00e+00  5.00e-01  5.00e-01  0.00e+00  0.00e+00]
 [ 8.82e-17 -8.82e-17 -8.82e-17  7.24e-01  1.17e+00]
 [ 5.45e-17 -5.45e-17 -5.45e-17  4.47e-01  7.24e-01]]

So far, I ducted 2 rank from original matrix. As for last matrix, interestingly first row and second row became same! Bow representation was abstracted by deducting rank of matrix, hence difference of "Osaka" and "Naomi" seemed as similar word. As for "soccer" topic, there was evident difference in original sentence of "Japan", however the last matrix has more vague distinction. These mean somehow we extracte similar words and find topic from bow:) Now we can deduct rank to 1 as following.

In [28]:
sigma_rank1 = np.diag(np.concatenate([sigma[:1],np.array([0,0,0])]))
print('new sigma (1 sigular value deleted):\n',sigma_rank1)
print('Compose bow with 1 rank deduction\n',U@sigma_rank1@V[0:4] )
new sigma (1 sigular value deleted):
 [[1.73 0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]]
Compose bow with 1 rank deduction
 [[1.  0.5 0.5 0.  0. ]
 [1.  0.5 0.5 0.  0. ]
 [0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0. ]]

Just as you expected, first and second row vector became same, and third and forth vector are also identical. It implies there is two topic amond 4 sentence.

4. Dimension Deduction

Sometimes you might wanna reduce dimension instead of rank, in that situation, you can multiply matrix of basis from left side as following.

In [37]:
print('original bow :\n', bow)
print('1 dimention deduction : \n',bow@V[:4,:].T)
print('2 dimention deduction : \n',bow@V[:3,:].T)
print('3 dimention deduction : \n',bow@V[:2,:].T)
print('4 dimention deduction : \n',bow@V[:1,:].T)
original bow :
 [[1 1 0 0 0]
 [1 0 1 0 0]
 [0 0 0 1 1]
 [0 0 0 0 1]]
1 dimention deduction : 
 [[ 1.22e+00  0.00e+00 -7.07e-01  0.00e+00]
 [ 1.22e+00  1.23e-32  7.07e-01  1.23e-32]
 [ 0.00e+00  1.38e+00  0.00e+00 -3.25e-01]
 [ 0.00e+00  8.51e-01  0.00e+00  5.26e-01]]
2 dimention deduction : 
 [[ 1.22e+00  0.00e+00 -7.07e-01]
 [ 1.22e+00  1.23e-32  7.07e-01]
 [ 0.00e+00  1.38e+00  0.00e+00]
 [ 0.00e+00  8.51e-01  0.00e+00]]
3 dimention deduction : 
 [[1.22e+00 0.00e+00]
 [1.22e+00 1.23e-32]
 [0.00e+00 1.38e+00]
 [0.00e+00 8.51e-01]]
4 dimention deduction : 
 [[1.22]
 [1.22]
 [0.  ]
 [0.  ]]