Tuesday, August 7, 2018

Basic usage of CaboCha in Python

When I work on *natural language processing* of Japanese language," I believe *dependency structre analysis*" is one of the crucial approach. When it comes to Japanese language, *CaboCha* is one of predominantly used tool. Needless to say, *binding* for Python is provided. In this article, I will share some basic tips regarding the way of utilizing it.

1. fundamental usage

Fist of all, I'd like to write it down the fundamental usage of Cabocha in python. Apparently in a "CABOCHA_FORMAT_TREE" foramat, the position of 'D' indicates which chunk relates to othere chunk.

In [1]:
import CaboCha
In [2]:
# Instantiate CaboCha.Parser class
cap = CaboCha.Parser()
# Parse objective sentence
tree = cap.parse('久々に新しいmacを買った。')
In [3]:
# You can check the dependency in "CABOCHA_FORMAT_TREE".
print(tree.toString(CaboCha.CABOCHA_FORMAT_TREE))
  久々に-----D
    新しい-D |
       macを-D
      買った。
EOS

In [4]:
# You can get dependency, chunk and token in xml format.
print(tree.toString(CaboCha.CABOCHA_FORMAT_XML))
<sentence>
 <chunk id="0" link="3" rel="D" score="-1.640429" head="0" func="1">
  <tok id="0" feature="名詞,一般,*,*,*,*,久々,ヒサビサ,ヒサビサ">久々</tok>
  <tok id="1" feature="助詞,格助詞,一般,*,*,*,に,ニ,ニ">に</tok>
 </chunk>
 <chunk id="1" link="2" rel="D" score="1.466958" head="2" func="2">
  <tok id="2" feature="形容詞,自立,*,*,形容詞・イ段,基本形,新しい,アタラシイ,アタラシイ">新しい</tok>
 </chunk>
 <chunk id="2" link="3" rel="D" score="-1.640429" head="3" func="4">
  <tok id="3" feature="名詞,一般,*,*,*,*,*">mac</tok>
  <tok id="4" feature="助詞,格助詞,一般,*,*,*,を,ヲ,ヲ">を</tok>
 </chunk>
 <chunk id="3" link="-1" rel="D" score="0.000000" head="5" func="6">
  <tok id="5" feature="動詞,自立,*,*,五段・ワ行促音便,連用タ接続,買う,カッ,カッ">買っ</tok>
  <tok id="6" feature="助動詞,*,*,*,特殊・タ,基本形,た,タ,タ">た</tok>
  <tok id="7" feature="記号,句点,*,*,*,*,。,。,。">。</tok>
 </chunk>
</sentence>

The followings are the some useful attirute CaboCha offer.

In [5]:
# the number of chunk
print('The number of chunk : ', tree.chunk_size())
# the number of token
print('The number of token : ', tree.token_size())
# 3rd token in the sentence
print('3rd token in the sentence : ', tree.token(2).surface)
# 3rd chunk in the sentence
print('3rd chunk relates to ', tree.chunk(2).link)
# The number of tokens the 1st chunk includes
print('The number of tokens the 1st chunk inclued is ',
      tree.chunk(2).token_size)
The number of chunk :  4
The number of token :  8
3rd token in the sentence :  新しい
3rd chunk relates to  3
The number of tokens the 1st chunk inclued is  2

2. Useful function which retrieve dependency structure analysis

In practice, we don't wanna use attributes above each time. I'd like to share one of the example of function which extract dependency analysis from a sentence. It seems a somehow clumsy however I would be glad if you get some sence out of it.
First of all, think about extract of dependency and chunks from a sentence.

In [6]:
def dep_ana(sentence, alltoken=True):
    """
    Return the result of dependent analysis
    """
    tokens, chunks = tok_chu_ana(sentence)
    
    depend_rel = {}
    chunk_list =[]
    score_list = []
    num=0
    
    for i in range(len(chunks)):
        # store dependency
        depend_rel[i] = chunks[i].link
        score_list.append(chunks[i].score)
        temp_chunk = ''
    
        for _j in range(chunks[i].token_size):
            if tokens[num].chunk is not None or alltoken :
                temp_chunk +=tokens[num].feature.split(',')[6]
            num = num+1
        chunk_list.append(temp_chunk)
    
    return depend_rel,chunk_list,score_list
    
def tok_chu_ana(sentence):
    """
    Return tokens and chunks that sentence contains
    """
    cap = CaboCha.Parser()
    tree = cap.parse(sentence)
    
    tokens = [tree.token(i) for i in range(tree.token_size())]
    chunks = [tree.chunk(i) for i in range(tree.chunk_size())]
    
    return tokens,chunks

The following is usage of this function.

In [7]:
dependency,chunks,scores = dep_ana('久々に新しい鉛筆を買った。',
                                   alltoken=True)
print('dependency : ',dependency)
print('chunks : ',chunks)
print('score : ',scores)
dependency :  {0: 3, 1: 2, 2: 3, 3: -1}
chunks :  ['久々に', '新しい', '鉛筆を', '買うた。']
score :  [-1.6404287815093994, 1.4669578075408936, -1.6404287815093994, 0.0]

You can also get first token in each chunk by specifying 'alltoken' as False.

In [8]:
dep_ana('久々に新しい鉛筆を買った。',alltoken=False)
Out[8]:
({0: 3, 1: 2, 2: 3, 3: -1},
 ['久々', '新しい', '鉛筆', '買う'],
 [-1.6404287815093994, 1.4669578075408936, -1.6404287815093994, 0.0])

Now we'd like to obtain the result of dependency structure analysis.

In [9]:
def extract_dep(result_of_analysis,threshhold=0):
    
    # list to contain the result of dependency analysis
    depend_list = []
    
    for depends, chunks, scores in result_of_analysis:
        temp_depend = []
    
        for key,score in zip(depends.keys(), scores):
            if score >threshhold:
                temp_depend.append('{}...{}'.format(chunks[key],
                                                    chunks[depends[key]]))
        
        depend_list.append(temp_depend)
    
    return depend_list
In [10]:
sentences = ['久々に新しい鉛筆を買った。','白い猫が横切った。']
result_of_analysis = [dep_ana(sentence,alltoken=False) for 
                      sentence in sentences]
extract_dep(result_of_analysis)
Out[10]:
[['新しい...鉛筆'], ['白い...猫', '猫...横切る']]

No comments:

Post a Comment