GithubHelp home page GithubHelp logo

dalinvip / cw2vec Goto Github PK

View Code? Open in Web Editor NEW
274.0 11.0 66.0 1.48 MB

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

License: Apache License 2.0

CMake 0.70% Shell 0.46% C++ 98.33% C 0.51%
word2vec cw2vec embeddings fasttext stroke-information

cw2vec's Introduction

Introduction

Paper Link: cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Paper Detail Summary: cw2vec理论及其实现

Requirements

cmake version 3.10.0-rc5
make GNU Make 4.1
gcc version 5.4.0

Run Demo

  • I have uploaded word2vec binary executable file in cw2vec/word2vec/bin and rewrite run.sh for simple test, you can run run.sh directly for simple test.

  • According to the Building cw2vec using cmake to recompile and run other model with the Example use cases.

Building cw2vec using cmake

git clone [email protected]:bamtercelboo/cw2vec.git
cd cw2vec && cd word2vec && cd build
cmake ..
make
cd ../bin

This will create the word2vec binary and also all relevant libraries.

Example use cases

the repo not only implement cw2vec(named substoke), but also the skipgram, cbow of word2vec, furthermore, fasttext skipgram is implemented(named subword).

Please modify train.txt and feature.txt into your own train document.

skipgram: ./word2vec skipgram -input train.txt -output skipgram_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100  

cbow:     ./word2vec cbow -input train.txt -output cbow_out -lr 0.05 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -thread 8 -t 1e-4 -lrUpdateRate 100

subword:  ./word2vec subword -input train.txt -output subword_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 6 -thread 8 -t 1e-4 -lrUpdateRate 100

substoke: ./word2vec substoke -input train.txt -infeature feature.txt -output substoke_out -lr 0.025 -dim 100 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 18 -thread 8 -t 1e-4 -lrUpdateRate 100

Get chinese stoke feature

substoke model need chinese stoke feature(-infeature),I have written a script to acquire the Chinese character of stroke information from handian. here is the script extract_zh_char_stoke, see the readme for details.

Now, I have uploaded a file of stroke features in simplified Chinese, which contains a total of 20901 Chinese characters for use. The file in the Simplified_Chinese_Feature folder. Or you can use the above script to get it yourself.

feature file(feature.txt) like this:

中 丨フ一丨
国 丨フ一一丨一丶一
庆 丶一ノ一ノ丶
假 ノ丨フ一丨一一フ一フ丶
期 一丨丨一一一ノ丶ノフ一一
香 ノ一丨ノ丶丨フ一一
江 丶丶一一丨一
将 丶一丨ノフ丶一丨丶
涌 丶丶一フ丶丨フ一一丨
入 ノ丶
人 ノ丶
潮 丶丶一一丨丨フ一一一丨ノフ一一
......

I provided a feature file for the test,path is sample/substoke_feature.txt.

Substoke model output embeddings

  • In this paper, the context word embeddings is used directly as the final word vector. However, according to the idea of fasttext, I also take into account the n-gram feature vector of the stroke information, the n-gram feature vector of the stroke information is taken as an average substitute for the word vector.

  • There are two outputs in substoke model:

    • output ends with vec is the context word vector.
    • output ends with avg is the n-gram feature vector average.

Word similarity evaluation

1. Evaluation script

I have already written a Chinese word similarity evaluation script. Chinese-Word-Similarity-and-Word-Analogy, see the readme for details.

2. Parameter Settings

The parameters are set as follows:

dim  100
window sizes  5
negative  5
epoch  5
minCount  10
lr  skipgram(0.025),cbow(0.05),substoke(0.025)
n-gram  minn=3, maxn=18

3. result

Experimental results show follows

Full documentation

Invoke a command without arguments to list available arguments and their default values:

./word2vec 
usage: word2vec <command> <args>
The commands supported by word2vec are:

skipgram  ------ train word embedding by use skipgram model
cbow      ------ train word embedding by use cbow model
subword   ------ train word embedding by use subword(fasttext skipgram)  model
substoke  ------ train chinses character embedding by use substoke(cw2vec) model

./word2vec substoke -h
Train Embedding By Using [substoke] model
Here is the help information! Usage:

The Following arguments are mandatory:
	-input              training file path
	-infeature          substoke feature file path
	-output             output file path

The Following arguments are optional:
	-verbose            verbosity level[2]

The following arguments for the dictionary are optional:
	-minCount           minimal number of word occurences default:[10]
	-bucket             number of buckets default:[2000000]
	-minn               min length of char ngram default:[3]
	-maxn               max length of char ngram default:[6]
	-t                  sampling threshold default:[0.001]

The following arguments for training are optional:
	-lr                 learning rate default:[0.05]
	-lrUpdateRate       change the rate of updates for the learning rate default:[100]
	-dim                size of word vectors default:[100]
	-ws                 size of the context window default:[5]
	-epoch              number of epochs default:[5]
	-neg                number of negatives sampled default:[5]
	-loss               loss function {ns} default:[ns]
	-thread             number of threads default:[1]
	-pretrainedVectors  pretrained word vectors for supervised learning default:[]
	-saveOutput         whether output params should be saved default:[false]

References

[1] Cao, Shaosheng, et al. "cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information." (2018).
[2] Bojanowski, Piotr, et al. "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606 (2016).
[3] fastText-github
[4] cw2vec理论及其实现

cw2vec's People

Contributors

dalinvip avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cw2vec's Issues

install error

In file included from /algor/zhouc/cw2vec/word2vec/src/main.cpp:13:0:
/algor/zhouc/cw2vec/word2vec/src/include/args.h: In member function ‘void Args::parseArgs(const std::vector<std::basic_string >&)’:
/algor/zhouc/cw2vec/word2vec/src/include/args.h:168:10: error: expected type-specifier
catch (std::out_of_range){
^
/algor/zhouc/cw2vec/word2vec/src/include/args.h:168:27: error: expected unqualified-id before ‘)’ token
catch (std::out_of_range){
^
make[2]: *** [src/CMakeFiles/word2vec.dir/main.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/word2vec.dir/all] Error 2
make: *** [all] Error 2

Killed while initialize feature

Hi, I'm running your code to implement 'substoke' model with my 80G corpus, but it was killed.
Here' s the picture of the error. And I modify the run.sh like this:

path_input=/data2/private/huangjunjie/COS960
path_out=.

rm -rf ./bin
cp -rf ./word2vec/bin .

./bin/word2vec substoke -input ${path_input}/SogouT_all -infeature ./Simplified_Chinese_Feature/sin_chinese_feature.txt -output ${path_out}/cw2vec_vector -lr 0.025 -dim 300 -ws 5 -epoch 5 -minCount 10 -neg 5 -loss ns -minn 3 -maxn 18 -thread 20 -t 1e-4 -lrUpdateRate 100

Could you tell me what's wrong with it? Thanks.
image

如何训练大规模语料

您好,之前使用您提供的代码实现了小规模的字向量的构建,但现在需要对百度百科这种规模的语料训练词向量,想使用笔画特征来训练,请问对于这种涉及到多个文件的语料该如何训练?还有,看您的代码里面好像没有提供GPU加速功能吧?

BOW EOW problem

I don't think they add '<' and '>' to each word in the paper(substroke model)
image

lines of training corpus

Hi,

This repo is really fantastic! 🥂
In the training sample, I notice that each line consists of a complete passage. Why don't you place one sentence per line? Is there any insight behind it?

Many thanks!

Time cost

hello,
I would like to know the time you spent to train per epoch and how many epochs should I train to get a good embedding model?

Thank you~

训练

怎么训练呀?没看懂

cw2vec training process

Hello, I have two questions:

  1. how to deal with English letters and numbers in train or test samples?
  2. this embedding is for words, does it suitable for character embeddings training?
    Thanks!

Input file format

What input file(training corpus) is used in your experiment? And what should the format be?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.