はじめに

本記事では，gensimモジュールを用いてWord2Vecで分散表現を獲得・保存・読み込む方法を紹介します．

公式リファレンス:

分散表現の学習

ここでは生データから分散表現を学習する方法を説明します．具体的には，gensim.models.word2vec.Word2Vec()の関数を用います．入力のデータ構造は単語リストのリストです．

from gensim.models import word2vec

sample_sents = [['this', 'is', 'a', 'first', 'sentence', '.'],
                ['this', 'is', 'a', 'second', 'sentence', '.']]
model = word2vec.Word2Vec(sentences=sample_sents, size=100, window=5, min_count=1)

実行すると，modelに学習結果が格納されます．これは<class 'gensim.models.word2vec.Word2Vec'>というオブジェクトです．

各種オプション

word2vec.Word2Vec()でよく使われるオプションを紹介します．

公式リファレンス：https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

オプション	説明	デフォルト
sentences=	元となるコーパス．単語リストのリスト．
courpus_file=	コーパスをファイル読み込みする場合に指定．1行1文の形式で，単語は空白区切りで認識される．
size=	分散表現の次元．リファレンスではvector_sizeと書いてあるように見えるが，sizeでないと動かない．	100
windows=	学習時に利用される文脈の広さ．	5
min_count=	分散表現を獲得する単語の最小頻度．1なら全ての単語について獲得される．	5
workers=	学習時の使用スレッド数．	3
sg=	学習アルゴリズムの選択．1ならskip-gram，0ならCBOW．	0

学習済み分散表現の機能

<class 'gensim.models.word2vec.Word2Vec'>の機能を簡単に説明します．ここでは，以下のコードで獲得した分散表現を用いた例を示します．（本来はsizeを100~300程度にすべきですし，文をもっと増やすべきです．）

from gensim.models import word2vec

sample_sents = [['this', 'is', 'a', 'first', 'sentence', '.'],
                ['this', 'is', 'a', 'second', 'sentence', '.']]
model = word2vec.Word2Vec(sentences=sample_sents, size=3, window=5, min_count=1)

ある単語の分散表現を得る．
.wvはWord2VecKeyedVectorsというオブジェクトで，単語をキー，分散表現を値に持つ辞書のように扱えます．

print(model.wv['this'])
# [ 0.12677142 -0.07538117 -0.13080813]

2つの単語の類似度を得る．

print(model.similarity('first', 'second'))
# -0.7543343

ある単語と類似している単語を上位 $topn$ 件得る．返り値は(単語, 類似度)のリスト．

n = 5
print(model.most_similar('this', topn=n))
[('is', 0.8868916034698486),
 ('second', 0.8849490880966187),
 ('sentence', 0.6720788478851318),
 ('first', 0.5845127105712891),
 ('.', 0.3697856068611145)]

単語ベクトルの足し引き．
王 - 男 + 女 = 女王　みたいなやつ．positive=に正の項の単語を，negative=に負の項の単語を指定する．topn=で上位 $topn$ 件を得る．

print(model.most_similar(positive=['this', 'first'], negative=['second'], topn=1))
# [('is', 0.15704883635044098)]
# this - second + firstということ

# 王の例だと
# most_similar(positive=['king', 'woman'], negative=['man']) のように書ける．

分散表現の保存

学習した分散表現は，.wv.save_word2vec_format(保存ファイルパス)で保存できます．

from gensim.models import word2vec
from gensim.models import KeyedVectors

sample_sents = [['this', 'is', 'a', 'first', 'sentence', '.'],
                ['this', 'is', 'a', 'second', 'sentence', '.']]
model = word2vec.Word2Vec(sample_sents, size=3, window=5, min_count=1)
model.wv.save_word2vec_format('sample_word2vec.txt')

保存結果は以下のようになります．1行目には単語数と分散表現の次元が，2行目以降は分散表現が並んでおり，1行1単語に相当します．

7 3
this 0.12709168 -0.11746123 -0.1590661
is -0.10325706 0.14546975 -0.10878078
a 0.0123018725 0.104428194 -0.069693
sentence 0.16237356 -0.07644377 0.16515312
. 0.09359499 0.12543988 -0.01799449
first -0.019889886 -0.077862106 0.13868278
second 0.060134348 0.029044067 0.03352099

一般には，容量が削減できることから，binary=Trueとしてバイナリファイルで保存・公開されることが多いと思います．

model.wv.save_word2vec_format('sample_word2vec.bin', binary=True)

分散表現の読み込み

.wv.save_word2vec_format()で保存された分散表現は，KeyedVectors.load_word2vec_format(ファイルパス)で読み込めます．

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('sample_word2vec.txt')
print(model.wv['this']) 
# [ 0.12677142 -0.07538117 -0.13080813]

バイナリファイルを読み込む場合は，保存のときと同様binary=Trueを指定します．

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('sample_word2vec.bin', binary=True)

例えば，学習済みの分散表現として代表的なGoogleNews-vectors-negative300はバイナリファイルで保存されているので，binary=Trueとして読み込みます．

おわりに

今回はgensimモジュールを用いてWord2Vecで分散表現を獲得・保存・読み込む方法を紹介しました．

gotutiyan’s blog

【python】gensimモジュールで分散表現を獲得・保存・読み込む方法を丁寧に