1. fundamental usage¶
Fist of all, I'd like to write it down the fundamental usage of Cabocha in python. Apparently in a "CABOCHA_FORMAT_TREE" foramat, the position of 'D' indicates which chunk relates to othere chunk.
import CaboCha
# Instantiate CaboCha.Parser class
cap = CaboCha.Parser()
# Parse objective sentence
tree = cap.parse('久々に新しいmacを買った。')
# You can check the dependency in "CABOCHA_FORMAT_TREE".
print(tree.toString(CaboCha.CABOCHA_FORMAT_TREE))
# You can get dependency, chunk and token in xml format.
print(tree.toString(CaboCha.CABOCHA_FORMAT_XML))
The followings are the some useful attirute CaboCha offer.
# the number of chunk
print('The number of chunk : ', tree.chunk_size())
# the number of token
print('The number of token : ', tree.token_size())
# 3rd token in the sentence
print('3rd token in the sentence : ', tree.token(2).surface)
# 3rd chunk in the sentence
print('3rd chunk relates to ', tree.chunk(2).link)
# The number of tokens the 1st chunk includes
print('The number of tokens the 1st chunk inclued is ',
tree.chunk(2).token_size)
2. Useful function which retrieve dependency structure analysis¶
In practice, we don't wanna use attributes above each time. I'd like to share one of the example of function which extract dependency analysis from a sentence. It seems a somehow clumsy however I would be glad if you get some sence out of it.
First of all, think about extract of dependency and chunks from a sentence.
def dep_ana(sentence, alltoken=True):
"""
Return the result of dependent analysis
"""
tokens, chunks = tok_chu_ana(sentence)
depend_rel = {}
chunk_list =[]
score_list = []
num=0
for i in range(len(chunks)):
# store dependency
depend_rel[i] = chunks[i].link
score_list.append(chunks[i].score)
temp_chunk = ''
for _j in range(chunks[i].token_size):
if tokens[num].chunk is not None or alltoken :
temp_chunk +=tokens[num].feature.split(',')[6]
num = num+1
chunk_list.append(temp_chunk)
return depend_rel,chunk_list,score_list
def tok_chu_ana(sentence):
"""
Return tokens and chunks that sentence contains
"""
cap = CaboCha.Parser()
tree = cap.parse(sentence)
tokens = [tree.token(i) for i in range(tree.token_size())]
chunks = [tree.chunk(i) for i in range(tree.chunk_size())]
return tokens,chunks
The following is usage of this function.
dependency,chunks,scores = dep_ana('久々に新しい鉛筆を買った。',
alltoken=True)
print('dependency : ',dependency)
print('chunks : ',chunks)
print('score : ',scores)
You can also get first token in each chunk by specifying 'alltoken' as False.
dep_ana('久々に新しい鉛筆を買った。',alltoken=False)
Now we'd like to obtain the result of dependency structure analysis.
def extract_dep(result_of_analysis,threshhold=0):
# list to contain the result of dependency analysis
depend_list = []
for depends, chunks, scores in result_of_analysis:
temp_depend = []
for key,score in zip(depends.keys(), scores):
if score >threshhold:
temp_depend.append('{}...{}'.format(chunks[key],
chunks[depends[key]]))
depend_list.append(temp_depend)
return depend_list
sentences = ['久々に新しい鉛筆を買った。','白い猫が横切った。']
result_of_analysis = [dep_ana(sentence,alltoken=False) for
sentence in sentences]
extract_dep(result_of_analysis)
No comments:
Post a Comment