Word2vecのKeyerror
下記参考URLをもとに、word2vecを動かしてみたいと思いました。
以下train.py,similars.pyのファイル、作業手順は全てこのサイトからの転用です。
作業手順
mecabで青空文庫のファイルを分かち書きしたのち、以下のファイルで学習させました。
生成したmodelはdata22.modelとして保存。
train.py
-*- coding: utf-8 -*-
from gensim.models import word2vec
import logging
import sys
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
sentences = word2vec.LineSentence(sys.argv[1])
model = word2vec.Word2Vec(sentences,
sg=1,
size=100,
min_count=1,
window=10,
hs=1,
negative=0)
model.save(sys.argv[2])
pythonでtrain.pyを実行。結果のmodelにdata22.modelと名前をつけて保存。
$ python train.py data22.txt data22.model
2017-04-08 01:49:31,381 : INFO : collecting all words and their counts
2017-04-08 01:49:31,382 : INFO : PROGRESS: at sentence #0, processed 0
words, keeping 0 word types
2017-04-08 01:49:31,389 : INFO : collected 1684 word types from a
corpus of 9554 raw words and 228 sentences
2017-04-08 01:49:31,389 : INFO : Loading a fresh vocabulary
2017-04-08 01:49:31,395 : INFO : min_count=1 retains 1684 unique words
(100% of original 1684, drops 0)
2017-04-08 01:49:31,395 : INFO : min_count=1 leaves 9554 word corpus (100% of original 9554, drops 0)
2017-04-08 01:49:31,405 : INFO : deleting the raw counts dictionary of 1684 items
2017-04-08 01:49:31,406 : INFO : sample=0.001 downsamples 45 most-common words
2017-04-08 01:49:31,407 : INFO : downsampling leaves estimated 5687 word corpus (59.5% of prior 9554)
2017-04-08 01:49:31,407 : INFO : estimated required memory for 1684 words and 100 dimensions: 2526000 bytes
2017-04-08 01:49:31,410 : INFO : constructing a huffman tree from 1684 words
2017-04-08 01:49:31,496 : INFO : built huffman tree with maximum node depth 13
2017-04-08 01:49:31,496 : INFO : resetting layer weights
2017-04-08 01:49:31,544 : INFO : training model with 3 workers on 1684 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=0 window=10
2017-04-08 01:49:31,544 : INFO : expecting 228 sentences, matching count from corpus used for vocabulary survey
2017-04-08 01:49:31,708 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-04-08 01:49:31,766 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-04-08 01:49:31,767 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-04-08 01:49:31,767 : INFO : training on 47770 raw words (28489 effective words) took 0.2s, 128642 effective words/s
2017-04-08 01:49:31,767 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2017-04-08 01:49:31,767 : INFO : saving Word2Vec object under data22.model, separately None
2017-04-08 01:49:31,767 : INFO : not storing attribute syn0norm
2017-04-08 01:49:31,767 : INFO : not storing attribute cum_table
2017-04-08 01:49:31,870 : INFO : saved data22.model
指定した単語と類似度の高い単語をリストアップするスクリプトsimilars.pyを用意。
similars.py
# -*- coding: utf-8 -*-
from gensim.models import word2vec
import sys
model = word2vec.Word2Vec.load(sys.argv[1])
results = model.most_similar(positive=sys.argv[2], topn=10)
for result in results:
print(result[0], '\t', result[1])
先ほど作成したmodelファイルでsimilars.pyを「本」という単語を引数にして実行。すると以下のエラーが出てしまいます。引数に指定した「本」という単語が認識されていないようですが、原因がわかりません。
$ python similars.py data22.model 本
Traceback (most recent call last):
File "similars.py", line 7, in <module>
results = model.most_similar(positive=sys.argv[2], topn=10)
File "/usr/local/lib/python2.7/site-
packages/gensim/models/word2vec.py", line 1285, in most_similar
return self.wv.most_similar(positive, negative, topn,
restrict_vocab, indexer)
File "/usr/local/lib/python2.7/site-
packages/gensim/models/keyedvectors.py", line 97, in most_similar
**raise KeyError("word '%s' not in vocabulary" % word)**
**KeyError: "word '\xe6\x9c\xac' not in vocabulary"**
どなたか、解決のヒントをいただければ幸いです。よろしくお願いします。