For instance, word "M&A" is separated into "M", "&", "A" as bellow.
$ echo 'M&A' | mecab
M 名詞,固有名詞,組織,*,*,*,*
& 名詞,サ変接続,*,*,*,*,*
A 名詞,固有名詞,組織,*,*,*,*
EOS
$
However that word is expected to be one word.
Hence I'm gonna write the way to add user dictionary to MeCab down here for reminder for me. 1. Create csv file
At first, you need to prepare csv dictionary file.The format is following.表層形,左文脈ID,右文脈ID,コスト,品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音
source : MeCab: 単語の追加方法"表層形" & "コスト" seem to be required at minimum. And regarding "コスト", Smaller number is prioritized. Hence the word whose 'コスト' is 1 is most prioritized.
Tentatively I made user dictionary as followings,
$ cat ./dic_M\&A.csv
M&A,,,1,名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
$
2. Create dictionary
First of all, directory for user dictionary should be created.
$ ls /usr/local/lib/mecab/dic/
ipadic
$ mkdir /usr/local/lib/mecab/dic/userdic
$ ls /usr/local/lib/mecab/dic/
ipadic userdic
Now you can create dictionary with "mecab-dict-index" command as bellow.
$ /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index -d /usr/local/lib/mecab/dic/ipadic/ -u /usr/local/lib/mecab/dic/userdic/user.dic -f utf-i -t utf-8 "dic_M&A.csv"
reading dic_M&A.csv ... 1
emitting double-array: 100% |###########################################|
done!
$ ls /usr/local/lib/mecab/dic/userdic/user.dic
/usr/local/lib/mecab/dic/userdic/user.dic
$ cat /usr/local/lib/mecab/dic/userdic/user.dic
H?q?f$$?
Gutf-8))R????R&名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,hiroshis-mbp:dictionary_for_Mecab uratah$
3. Register user dictionary to mecabrc
Unless user dictionary is registered to mecabrc, it won't become effective.
$ cp -p /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423
$ vi /usr/local/etc/mecabrc
$ diff /usr/local/etc/mecabrc /usr/local/etc/mecabrc.20180423
9d8
< userdic = /usr/local/lib/mecab/dic/userdic/user.dic
$
4. Check the behavior
Now is the time to check the behavior of MeCab which is applied user dictionary.
$ echo 'M&A' | mecab
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$
Postscript
User dictionary can be specified with -u option at the time of execution.
$ echo 'M&A' | mecab -u /usr/local/lib/mecab/dic/userdic/user.dic
M&A 名詞,一般,*,*,*,*,M&A,エムアンドエー,エムアンドエー,
EOS
$
You can specify multiple user dictionary by concatenate with comma "," as following.
userdic = /usr/local/lib/mecab/dic/userdic/user.dic,/usr/local/lib/mecab/dic/userdic/user2.dic
No comments:
Post a Comment