Monday, April 23, 2018

How to add user dictionary to MeCab??

When I analyze Japanese language, one of the biggest difference from English language is Morphological Analysis. Therefore for me, "Natural Language Processing(NLP)", "Topic Model" and "MeCab" goes hand-in-hand. Almost every time, I need user dictionary which apply for MeCab.
For instance, word "M&A" is separated into "M", "&", "A" as bellow.

$ echo 'M&A' | mecab
M 名詞,固有名詞,組織,*,*,*,*
& 名詞,サ変接続,*,*,*,*,*
A 名詞,固有名詞,組織,*,*,*,*
EOS
$ 

However that word is expected to be one word. Hence I'm gonna write the way to add user dictionary to MeCab down here for reminder for me.

1. Create csv file

At first, you need to prepare csv dictionary file.The format is following.
表層形,左文脈ID,右文脈ID,コスト,品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音
source : MeCab: 単語の追加方法
"表層形" & "コスト" seem to be required at minimum. And regarding "コスト", Smaller number is prioritized. Hence the word whose 'コスト' is 1 is most prioritized.
Tentatively I made user dictionary as followings,

$ cat ./dic_M\&A.csv 
M&A,,,1,名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
$ 

2. Create dictionary

First of all, directory for user dictionary should be created.

$ ls /usr/local/lib/mecab/dic/
ipadic
$ mkdir /usr/local/lib/mecab/dic/userdic
$ ls /usr/local/lib/mecab/dic/
ipadic  userdic

Now you can create dictionary with "mecab-dict-index" command as bellow.

$ /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic/ -u /usr/local/lib/mecab/dic/userdic/user.dic -f utf-i -t utf-8 "dic_M&A.csv"
reading dic_M&A.csv ... 1
emitting double-array: 100% |###########################################| 

done!
$ ls /usr/local/lib/mecab/dic/userdic/user.dic 
/usr/local/lib/mecab/dic/userdic/user.dic
$ cat /usr/local/lib/mecab/dic/userdic/user.dic 
H?q?f$$?
Gutf-8))R????R&名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,hiroshis-mbp:dictionary_for_Mecab uratah$ 

3. Register user dictionary to mecabrc

Unless user dictionary is registered to mecabrc, it won't become effective.

$ cp -p /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423
$ vi /usr/local/etc/mecabrc
$ diff /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423 
9d8
< userdic = /usr/local/lib/mecab/dic/userdic/user.dic
$ 

4. Check the behavior

Now is the time to check the behavior of MeCab which is applied user dictionary.

$ echo 'M&A' | mecab
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$ 

Postscript

User dictionary can be specified with -u option at the time of execution.

$ echo 'M&A' | mecab -u /usr/local/lib/mecab/dic/userdic/user.dic
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$ 

You can specify multiple user dictionary by concatenate with comma "," as following.

userdic = /usr/local/lib/mecab/dic/userdic/user.dic,/usr/local/lib/mecab/dic/userdic/user2.dic

No comments:

Post a Comment