GithubHelp home page GithubHelp logo

wordmaker's Issues

编译出现警告,影响大文本的新词发现

cmake version 2.8.12.2, 编译完成后,在200MB以下文本的使用hugemaker没有问题,但超过该规模的文本使用hugemaker,会报如下错误:

hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)

发现会不会是因为编译wordmaker时出错影响了后面的使用,报错信息如下:
Linking CXX static library ../lib/libmarisa.a
[ 81%] Built target marisa
Scanning dependencies of target hugemaker
[ 90%] Building CXX object src/CMakeFiles/hugemaker.dir/hugemaker.cpp.o
src/hugemaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:/src/hugemaker.cpp:412:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
src/hugemaker.cpp:412:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) { ^
src/hugemaker.cpp:429:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/hugemaker.cpp:450:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/hugemaker.cpp: In member function ‘void WordMaker::remove_buck_files()’:
src/hugemaker.cpp:518:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1_run()’:
src/hugemaker.cpp:750:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < total_bucket; i++) {
^
src/hugemaker.cpp:769:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "pos:%d done saved keys:%d\n", pos, seq_trie.num_keys());
^
src/hugemaker.cpp: In member function ‘void WordMaker::reduce_step1()’:
src/hugemaker.cpp:806:79: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "open %s ok keys:%d\n", trie_file.c_str(), trie.num_keys());
^
src/hugemaker.cpp:813:75: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘std::size_t {aka long unsigned int}’ [-Wformat=]
fprintf(stderr, "build marisa trie. real total words:%d \n", kset.size());
^
src/hugemaker.cpp:828:32: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i = 0; i < kset.size(); i++) {
^
In file included from src/hugemaker.cpp:42:0:
src/cedar.h: In member function ‘int cedar::da<value_type, NO_VALUE, NO_PATH, ORDERED, MAX_TRIAL, NUM_TRACKING_NODES>::open(const char_, const char_, size_t, size_t) [with value_type = unsigned int; int NO_VALUE = -1; int NO_PATH = -2; bool ORDERED = true; int MAX_TRIAL = 1; long unsigned int NUM_TRACKING_NODES = 0ul; size_t = long unsigned int]’:
src/cedar.h:309:7: warning: ignoring return value of ‘size_t fread(void_, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadF, sizeof (int), 1, fp);
^
src/cedar.h:310:7: warning: ignoring return value of ‘size_t fread(void
, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadC, sizeof (int), 1, fp);
^
src/cedar.h:311:7: warning: ignoring return value of ‘size_t fread(void
, size_t, size_t, FILE_)’, declared with attribute warn_unused_result [-Wunused-result]
std::fread (&bheadO, sizeof (int), 1, fp);
^
Linking CXX executable ../bin/hugemaker
[ 90%] Built target hugemaker
Scanning dependencies of target wordmaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
src/wordmaker.cpp: In constructor ‘WordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker
)’:
src/wordmaker.cpp:196:14: warning: ‘WordMaker::trie_combine_t::pmaker’ will be initialized after [-Wreorder]
WordMaker_ pmaker;
^
src/wordmaker.cpp:194:10: warning: ‘size_t WordMaker::trie_combine_t::word_len’ [-Wreorder]
size_t word_len;
^
src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
trie_combine_t(ptrie_t pt, WordMaker* maker):ptrie(pt)
^
src/wordmaker.cpp: In member function ‘virtual void WordMaker::cad_gen_t::operator()(trie_result_t&)’:
src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (( tmp_len <= SHORTEST_WORD_LEN) || (tmp_len >= WORD_LEN)) {
^
src/wordmaker.cpp:281:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(left_f <= LEAST_FREQ)
^
src/wordmaker.cpp:288:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(right_f <= LEAST_FREQ)
^
src/wordmaker.cpp: In constructor ‘WordMaker::WordMaker(const char_, int)’:
src/wordmaker.cpp:547:15: warning: ‘WordMaker::step1_done’ will be initialized after [-Wreorder]
volatile int step1_done;
^
src/wordmaker.cpp:545:9: warning: ‘int WordMaker::thread_n’ [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char_ filename
^
src/wordmaker.cpp:545:9: warning: ‘WordMaker::thread_n’ will be initialized after [-Wreorder]
int thread_n;
^
src/wordmaker.cpp:538:15: warning: ‘pstring_list WordMaker::phz_str’ [-Wreorder]
pstring_list phz_str;
^
src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
WordMaker(const char* filename
^
Linking CXX executable ../bin/wordmaker
[100%] Built target wordmaker

重复发现的词

term freq left entropy right entropy

湿隔离 257 12.850000 2.502627

保湿隔离 255 12.750000 2.498990

这两个实际上可以组合成一个,代码里面可以加个支持啊.

Non-ISO extended-ASCII编码问题

为什么用./bin/wordmaker input.txt output.txt 命令后产生的文件,在more 下显示正常但vi 就乱码,也不能进行转换。file output.txt 显示Non-ISO extended-ASCII编码问题, 求解决。ubuntu 系统

代码中使用的统计量有哪些?

问题

hi jannson
您在另一个项目的说明中说到该项目用的算法来源于Martrix67的那篇文章,但是在阅读您的代码之后,发现您使用的统计量主要是左右邻接熵, 并未看到您使用凝固度,于是我在此基础上,又添加了凝固度统计量,可以又过滤一些“伪新词”, 但是在有些语料上该工具的新词的发现能力有时候还是不太好,。

所以,请问您的代码中主要都用了哪些统计量? 基于您的经验,如果要进一步优化,您觉得还需要做哪些方向的改进?谢谢!

K

编译有问题(对第一个问题的补充)

下了新版本的代码,cmake 没有问题了,但make 还是报错:
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2

编译有问题

用cmake2.8.7 进行cmake 时报错,并且make 也报错。
cmake的提示:
wordmaker-master/build$ cmake ..
CMake Warning (dev) in CMakeLists.txt:
No cmake_minimum_required command is present. A line of code such as

cmake_minimum_required(VERSION 2.8)

should be added at the top of the file. The version specified may be lower
if you wish to support older CMake versions for this project. For more
information run "cmake --help-policy CMP0000".
This warning is for project developers. Use -Wno-dev to suppress it.

-- Configuring done
CMake Warning (dev) at src/CMakeLists.txt:16 (ADD_EXECUTABLE):
Policy CMP0003 should be set before this line. Add code such as

if(COMMAND cmake_policy)
  cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)

as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "hugemaker" links to some libraries for which the linker
must search:

pthread

and other libraries with known full path:

/home/bigdata/wordmaker-master/build/lib/libmarisa.a

CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /home/bigdata/wordmaker-master/build

####################分割线

make 的报错:
wordmaker-master/build$ make
[ 81%] Built target marisa
[ 90%] Built target hugemaker
[100%] Building CXX object src/CMakeFiles/wordmaker.dir/wordmaker.o
/home/bigdata/wordmaker-master/src/wordmaker.cpp:325:26: error: ?.onstexpr?.needed for in-class initialization of static data member ?.?.of non-integral type
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::trie_combine_t::trie_combine_t(ptrie_t, WordMaker_)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:196:14: warning: ?.ordMaker::trie_combine_t::pmaker?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:194:10: warning: ?.ize_t WordMaker::trie_combine_t::word_len?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:176:3: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In member function ?.irtual void WordMaker::cad_gen_t::operator()(trie_result_t&)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:261:55: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:268:45: error: ?.?.was not declared in this scope
/home/bigdata/wordmaker-master/src/wordmaker.cpp:280:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:287:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
/home/bigdata/wordmaker-master/src/wordmaker.cpp: In constructor ?.ordMaker::WordMaker(const char_, int)?.
/home/bigdata/wordmaker-master/src/wordmaker.cpp:547:15: warning: ?.ordMaker::step1_done?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.nt WordMaker::thread_n?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:545:9: warning: ?.ordMaker::thread_n?.will be initialized after [-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:538:15: warning: ?.string_list WordMaker::phz_str?.[-Wreorder]
/home/bigdata/wordmaker-master/src/wordmaker.cpp:329:2: warning: when initialized here [-Wreorder]
make[2]: *** [src/CMakeFiles/wordmaker.dir/wordmaker.o] Error 1
make[1]: *** [src/CMakeFiles/wordmaker.dir/all] Error 2
make: *** [all] Error 2

使用hugemaker运行最后报错

之前使用hugemaker对100M左右的文件进行新词发现,可以正确运行,没问题。但今天出现如下报错信息,还请版主明示:hugemaker: /home/supermicro/programs/wordmaker-master/src/hugemaker.cpp:808: void WordMaker::reduce_step1(): Assertion `-1 != open_status' failed.
Aborted (core dumped)
word_freq.txt_seq都正常

hugemake 出core

Hi,

使用hugemake处理大约400M的文档时出错,挂在src/hugemaker.cpp 的808行,是在读最后的seq文件时挂掉的
assert(-1 != open_status);
处理20M的没有问题,但是结果是乱码尝试从gbk转成utf8也不行,输入文档编码如下
file ~/work_corpus_head
/home/ubuntu/work_corpus_head: UTF-8 Unicode text, with very long lines
输出文档编码如下
file work_word_new
work_word_new: Non-ISO extended-ASCII text, with LF, NEL line terminators
不知道对输入文档格式有什么要求吗
谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.