The code for Improve word representation learning with sememes(ACL2017).
Using the following command to train word-sense-sememe embeddings.
cp SSA.c[SSA.c/MST.c/SAC.c/SAT.c] word2vec/word2vec.c
cd word2vec
make
./word2vec -train TrainFile -output vectors.bin -cbow 0 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 30 -binary 1 -iter 1 -read-vocab VocabFile -read-meaning SememeFile -read-sense Word_Sense_Sememe_File -min-count 1
TrainFile
is train data set. VocabFile
is the word vocabulary file, and SememeFile
is the sememe vocabulary file. Word_Sense_Sememe_File
is a file recording group information of word-sense-sememe.
Before training, you should replace word2vec/word2vec.c
with one of the four files SSA.c/MST.c/SAC.c/SAT.c
.
HowNet.txt
is an Chinese knowledge base with annotated word-sense-sememe information.
Sougo-T(sample).txt
is a sample dataset extracted from Sougo-T
.
wordsim-240.txt
and wordsim-297.txt
in this files are utilized to evaluate the quality of word representations.
analogy.txt
in this file is utilized to evaluate models' capability of word analogy inference.
The annotation information is for the four files SSA.c/MST.c/SAC.c/SAT.c
. Annotation of the common code is only included in file SSA.c
.
Because of the size of train data is too large, I do not upload it. In the future, some eclectic measure will be conducted.