GithubHelp home page GithubHelp logo

wordmaker's Introduction

wordmaker 词语生成工具

通过词语组成的规律,自动从大文本当中学习得到文本当中的词语,而不再需要其它额外的信息。

很多分词库等都需要字典库,特别在一些专业的领域,需要得到很多的专业相关词语。而人工标注字典需要花很大的时间,所以希望有一个工具能够自动从文本中训练得到词语。分析某类人的用词特点,也可以有所应用。

重点说明

  • 大家通过邮件与我讨论问题我一般都会回复,但请尽可能在项目中通过open issue的方式提交问题。这样不仅会让我更及时看到问题,还能给项目的其它人带来好处。相信方便你也能方便其它后来的朋友。若您是open心态的人,应该明白这样的好处。
  • 项目已经搬迁至avplayer旗下, 有问题请在此项目下open issue。
  • avboost 是一个非常好的C++社区,虽然我与上面的各路版主不熟悉,也没有时间在上面发帖讨论问题,但是相信他们各种都是非常open非常牛的精英,所以把项目移到他们名下会让项目变得更好。

代码实现

最初尝试实现了一个简单版本,但基于单线程,运行速度慢,并且还消耗巨大的内存。最近尝试接触c++ 11,并使用了类似map/reduce**,所以决定拿这个项目练练手,也希望能有人多交流。**如下:

  • 基于节约内存的 Trie树结构 (Double Array Trie & MARISA)
  • 用多个线程独立计算各个文本块的词的信息,再按词的顺序分段合并,再计算各个段的字可能组成词的概率,左右熵,得到词语输出。 因为都分成独立的块处理,所以使用多线程非常方便。这种**应该可以应用在大部分的文本处理工作中,若有更好的办法欢迎交流~
  • src/wordmaker.cpp 将在内存当中完成所有计算,并只依赖于cedar.h文件。能够很快处理20M左右的文本。
  • src/hugemaker.cpp 为了节约内存,在计算、合并过程中将保存部分中间文件。可以处理更大的文本(50M+)。marisa:Trie支持mmap将trie文件mmap到内存当中,经过修改完全可以处理更巨大的文本(当时用的时间也会更久)。
  • 默认启用4个线程,可以修改代码的宏,也可以自己加shell命令控制。
  • 为了代码的简单,只支持gbk编码。(没太多检查,其它编码格式的输入将可能产生段错误,欢迎帮优化)
  • 在linux与cygwin下编码成功,visual studio下应该也没问题吧。
  • 因工作关系用C比C++时间多得多,希望使用代码的人看到代码有任何不符合现在c++观点的作法及时指出,以做交流!

编译与使用

mkdir build

cd build

cmake28 ..

make

./bin/wordmaker input.txt output.txt 或者

./bin/hugemaker input.txt output.txt

linux下可以使用:

iconv -f "gbk" -t "utf-8//IGNORE" < infile > outfile

进行编码转换。windows下当然可以使用notepad++了,转换成ANSI。

测试语料

sogou的新闻语料,把各个文本合成在一起总共50M:http://pan.baidu.com/s/1mgoIPxY 里面还有莫言的文集当做输入语料,方法大家测试代码。

这里是我运行的表现结果

TODO

  • 算法优化:结果里偶尔出现 “鼓舞了”、“默认了”、“鲜红的”等错误词,虽然算法里已经较好的处理了这种情况,但感觉应该有更好的办法,比如:给类似“的”、“了”、“在”等出现频率特别高的字更低的组词权重,而提高其它一些情况的词的权重,可能输出结果会更好。

  • 找到更多的语料,结合本工具,并实现一些办法,从这些语料当中分析得到更多的有趣信息。比如说网络流行词,某行业的专用词等。

wordmaker's People

Contributors

jannson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordmaker's Issues

编译出现警告,影响大文本的新词发现

cmake version 2.8.12.2, 编译完成后,在200MB以下文本的使用hugemaker没有问题,但超过该规模的文本使用hugemaker,会报如下错误:

hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)

发现会不会是因为编译wordmaker时出错影响了后面的使用,报错信息如下:
Linking CXX static library ../lib/libmarisa.a
[ 81%] Built target marisa
Scanning dependencies of target hugemaker
[ 90%] Building CXX object src/CMakeFiles/hugemaker.dir/hugemaker.cpp.o
src/hugemaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:/src/hugemaker.cpp:412:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
src/hugemaker.cpp:412:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) { ^
src/hugemaker.cpp:429:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/hugemaker.cpp:450:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/hugemaker.cpp: In member function ‘void WordMaker::remove_buck_files()’:
src/hugemaker.cpp:518:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1_run()’:
src/hugemaker.cpp:750:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp:769:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "pos:%d done saved keys:%d\n", pos, seq_trie.num_keys());
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1()’:
src/hugemaker.cpp:806:79: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "open %s ok keys:%d\n", trie_file.c_str(), trie.num_keys());
^
src/hugemaker.cpp:813:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘std::size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "build marisa trie. real total words:%d \n", kset.size());
^
src/hugemaker.cpp:828:32: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < kset.size(); i++) {
^
In file included from src/hugemaker.cpp:42:0:
src/cedar.h: In member function ‘int cedar::da<value_type, NO_VALUE, NO_PATH, ORDERED, MAX_TRIAL, NUM_TRACKING_NODES>::open(const char_, const char_, size_t, size_t) [with value_type = unsigned int; int NO_VALUE = -1; int NO_PATH = -2; bool ORDERED = true; int MAX_TRIAL = 1; long unsigned int NUM_TRACKING_NODES = 0ul; size_t = long unsigned int]’:
src/cedar.h:309:7: warning: ignoring return value of ‘size_t fread(void_, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadF, sizeof (int), 1, fp);
^
src/cedar.h:310:7: warning: ignoring return value of ‘size_t fread(void
, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadC, sizeof (int), 1, fp);
^
src/cedar.h:311:7: warning: ignoring return value of ‘size_t fread(void
, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadO, sizeof (int), 1, fp);
^
Linking CXX executable ../bin/hugemaker
[ 90%] Built target hugemaker
Scanning dependencies of target wordmaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
src/wordmaker.cpp: In constructor ‘WordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker
)’:
src/wordmaker.cpp:196:14: warning: ‘WordMaker::trie_combine_t::pmaker’ will be initialized after [-Wreorder]
WordMaker_ pmaker;
^
src/wordmaker.cpp:194:10: warning: ‘size_t WordMaker::trie_combine_t::word_len’ [-Wreorder]
size_t word_len;
^
src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
trie_combine_t(ptrie_t pt, WordMaker* maker):ptrie(pt)
^
src/wordmaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:
src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:281:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/wordmaker.cpp:288:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/wordmaker.cpp: In constructor ‘WordMaker::WordMaker(const char_, int)’:
src/wordmaker.cpp:547:15: warning: ‘WordMaker::step1_done’ will be initialized after [-Wreorder]
volatile int step1_done;
^
src/wordmaker.cpp:545:9: warning: ‘int WordMaker::thread_n’ [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char_ filename
^
src/wordmaker.cpp:545:9: warning: ‘WordMaker::thread_n’ will be initialized after [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:538:15: warning: ‘pstring_list WordMaker::phz_str’ [-Wreorder]
pstring_list phz_str;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char* filename
^
Linking CXX executable ../bin/wordmaker
[100%] Built target wordmaker

重复发现的词

term freq left entropy right entropy

湿隔离 257 12.850000 2.502627

保湿隔离 255 12.750000 2.498990

这两个实际上可以组合成一个,代码里面可以加个支持啊.

Non-ISO extended-ASCII编码问题

为什么用./bin/wordmaker input.txt output.txt 命令后产生的文件,在more 下显示正常但vi 就乱码,也不能进行转换。file output.txt 显示Non-ISO extended-ASCII编码问题, 求解决。ubuntu 系统

编译有问题

用cmake2.8.7 进行cmake 时报错,并且make 也报错。
cmake的提示:
wordmaker-master/build$ cmake ..
CMake Warning (dev) in CMakeLists.txt:
No cmake_minimum_required command is present. A line of code such as

cmake_minimum_required(VERSION 2.8)

should be added at the top of the file. The version specified may be lower
if you wish to support older CMake versions for this project. For more
information run "cmake --help-policy CMP0000".
This warning is for project developers. Use -Wno-dev to suppress it.

-- Configuring done
CMake Warning (dev) at src/CMakeLists.txt:16 (ADD_EXECUTABLE):
Policy CMP0003 should be set before this line. Add code such as

if(COMMAND cmake_policy)
  cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)

as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "hugemaker" links to some libraries for which the linker
must search:

pthread

and other libraries with known full path:

/home/bigdata/wordmaker-master/build/lib/libmarisa.a

CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /home/bigdata/wordmaker-master/build

####################分割线

make 的报错:
wordmaker-master/build$ make
[ 81%] Built target marisa
[ 90%] Built target hugemaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2

使用hugemaker运行最后报错

之前使用hugemaker对100M左右的文件进行新词发现,可以正确运行,没问题。但今天出现如下报错信息,还请版主明示:hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)
word_freq.txt_seq都正常

hugemake 出core

Hi,

使用hugemake处理大约400M的文档时出错,挂在src/hugemaker.cpp 的808行,是在读最后的seq文件时挂掉的
assert(-1 != open_status);
处理20M的没有问题,但是结果是乱码尝试从gbk转成utf8也不行,输入文档编码如下
file ~/work_corpus_head
/home/ubuntu/work_corpus_head: UTF-8 Unicode text, with very long lines
输出文档编码如下
file work_word_new
work_word_new: Non-ISO extended-ASCII text, with LF, NEL line terminators
不知道对输入文档格式有什么要求吗
谢谢!

代码中使用的统计量有哪些?

问题

hi jannson
您在另一个项目的说明中说到该项目用的算法来源于Martrix67的那篇文章,但是在阅读您的代码之后,发现您使用的统计量主要是左右邻接熵, 并未看到您使用凝固度,于是我在此基础上,又添加了凝固度统计量,可以又过滤一些“伪新词”, 但是在有些语料上该工具的新词的发现能力有时候还是不太好,。

所以,请问您的代码中主要都用了哪些统计量? 基于您的经验,如果要进一步优化,您觉得还需要做哪些方向的改进?谢谢!

K

编译有问题(对第一个问题的补充)

下了新版本的代码,cmake 没有问题了,但make 还是报错:
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.