jannson / wordmaker Goto Github PK
View Code? Open in Web Editor NEWauto generate chinese words in huge text.
auto generate chinese words in huge text.
cmake version 2.8.12.2, 编译完成后,在200MB以下文本的使用hugemaker没有问题,但超过该规模的文本使用hugemaker,会报如下错误:
hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)
发现会不会是因为编译wordmaker时出错影响了后面的使用,报错信息如下:
Linking CXX static library ../lib/libmarisa.a
[ 81%] Built target marisa
Scanning dependencies of target hugemaker
[ 90%] Building CXX object src/CMakeFiles/hugemaker.dir/hugemaker.cpp.o
src/hugemaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:/src/hugemaker.cpp:412:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
src/hugemaker.cpp:412:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) { ^
src/hugemaker.cpp:429:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/hugemaker.cpp:450:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/hugemaker.cpp: In member function ‘void WordMaker::remove_buck_files()’:
src/hugemaker.cpp:518:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1_run()’:
src/hugemaker.cpp:750:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp:769:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "pos:%d done saved keys:%d\n", pos, seq_trie.num_keys());
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1()’:
src/hugemaker.cpp:806:79: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "open %s ok keys:%d\n", trie_file.c_str(), trie.num_keys());
^
src/hugemaker.cpp:813:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘std::size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "build marisa trie. real total words:%d \n", kset.size());
^
src/hugemaker.cpp:828:32: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < kset.size(); i++) {
^
In file included from src/hugemaker.cpp:42:0:
src/cedar.h: In member function ‘int cedar::da<value_type, NO_VALUE, NO_PATH, ORDERED, MAX_TRIAL, NUM_TRACKING_NODES>::open(const char_, const char_, size_t, size_t) [with value_type = unsigned int; int NO_VALUE = -1; int NO_PATH = -2; bool ORDERED = true; int MAX_TRIAL = 1; long unsigned int NUM_TRACKING_NODES = 0ul; size_t = long unsigned int]’:
src/cedar.h:309:7: warning: ignoring return value of ‘size_t fread(void_, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadF, sizeof (int), 1, fp);
^
src/cedar.h:310:7: warning: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadC, sizeof (int), 1, fp);
^
src/cedar.h:311:7: warning: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadO, sizeof (int), 1, fp);
^
Linking CXX executable ../bin/hugemaker
[ 90%] Built target hugemaker
Scanning dependencies of target wordmaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
src/wordmaker.cpp: In constructor ‘WordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker)’:
src/wordmaker.cpp:196:14: warning: ‘WordMaker::trie_combine_t::pmaker’ will be initialized after [-Wreorder]
WordMaker_ pmaker;
^
src/wordmaker.cpp:194:10: warning: ‘size_t WordMaker::trie_combine_t::word_len’ [-Wreorder]
size_t word_len;
^
src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
trie_combine_t(ptrie_t pt, WordMaker* maker):ptrie(pt)
^
src/wordmaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:
src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:281:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/wordmaker.cpp:288:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/wordmaker.cpp: In constructor ‘WordMaker::WordMaker(const char_, int)’:
src/wordmaker.cpp:547:15: warning: ‘WordMaker::step1_done’ will be initialized after [-Wreorder]
volatile int step1_done;
^
src/wordmaker.cpp:545:9: warning: ‘int WordMaker::thread_n’ [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char_ filename
^
src/wordmaker.cpp:545:9: warning: ‘WordMaker::thread_n’ will be initialized after [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:538:15: warning: ‘pstring_list WordMaker::phz_str’ [-Wreorder]
pstring_list phz_str;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char* filename
^
Linking CXX executable ../bin/wordmaker
[100%] Built target wordmaker
term freq left entropy right entropy
湿隔离 257 12.850000 2.502627
保湿隔离 255 12.750000 2.498990
这两个实际上可以组合成一个,代码里面可以加个支持啊.
为什么用./bin/wordmaker input.txt output.txt 命令后产生的文件,在more 下显示正常但vi 就乱码,也不能进行转换。file output.txt 显示Non-ISO extended-ASCII编码问题, 求解决。ubuntu 系统
hi jannson
您在另一个项目的说明中说到该项目用的算法来源于Martrix67的那篇文章,但是在阅读您的代码之后,发现您使用的统计量主要是左右邻接熵, 并未看到您使用凝固度,于是我在此基础上,又添加了凝固度统计量,可以又过滤一些“伪新词”, 但是在有些语料上该工具的新词的发现能力有时候还是不太好,。
所以,请问您的代码中主要都用了哪些统计量? 基于您的经验,如果要进一步优化,您觉得还需要做哪些方向的改进?谢谢!
K
下了新版本的代码,cmake 没有问题了,但make 还是报错:
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2
用cmake2.8.7 进行cmake 时报错,并且make 也报错。
cmake的提示:
wordmaker-master/build$ cmake ..
CMake Warning (dev) in CMakeLists.txt:
No cmake_minimum_required command is present. A line of code such as
cmake_minimum_required(VERSION 2.8)
should be added at the top of the file. The version specified may be lower
if you wish to support older CMake versions for this project. For more
information run "cmake --help-policy CMP0000".
This warning is for project developers. Use -Wno-dev to suppress it.
-- Configuring done
CMake Warning (dev) at src/CMakeLists.txt:16 (ADD_EXECUTABLE):
Policy CMP0003 should be set before this line. Add code such as
if(COMMAND cmake_policy)
cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)
as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "hugemaker" links to some libraries for which the linker
must search:
pthread
and other libraries with known full path:
/home/bigdata/wordmaker-master/build/lib/libmarisa.a
CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.
-- Generating done
-- Build files have been written to: /home/bigdata/wordmaker-master/build
make 的报错:
wordmaker-master/build$ make
[ 81%] Built target marisa
[ 90%] Built target hugemaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2
之前使用hugemaker对100M左右的文件进行新词发现,可以正确运行,没问题。但今天出现如下报错信息,还请版主明示:hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)
word_freq.txt_seq都正常
Hi,
使用hugemake处理大约400M的文档时出错,挂在src/hugemaker.cpp 的808行,是在读最后的seq文件时挂掉的
assert(-1 != open_status);
处理20M的没有问题,但是结果是乱码尝试从gbk转成utf8也不行,输入文档编码如下
file ~/work_corpus_head
/home/ubuntu/work_corpus_head: UTF-8 Unicode text, with very long lines
输出文档编码如下
file work_word_new
work_word_new: Non-ISO extended-ASCII text, with LF, NEL line terminators
不知道对输入文档格式有什么要求吗
谢谢!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.