GithubHelp home page GithubHelp logo

fastcws / fastcws Goto Github PK

View Code? Open in Web Editor NEW
195.0 3.0 8.0 537 KB

轻量级高性能中文分词项目

License: BSD 2-Clause "Simplified" License

CMake 4.41% C++ 94.53% C 1.06%
chinese frequency-dictionary hidden-markov-model nlp-chinese word-break word-segment word-segmentation word-segmenter wordbreak wordseg wordsegmentation

fastcws's Introduction

fastcws

轻量级高性能中文分词项目

动图演示

如标题所言,fastcws性能极高。从动图中可以看出,fastcws冷启动加载只用了 0.125s;冷启动加上分词 18 万字只用了 0.35s。简单估算一下,已经达到了单核百万字的水准!

命令行工具

fastcws命令行工具(从源码编译的话,位于src/tools/fastcws)可以直接将stdin的输入按句分词后输出到stdout

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

可以用管道方便的将文件分词后,转储到另一个文件:

$ cat input.txt | fastcws > output.txt

此外,还支持自定义分隔符、从文件加载词典、HMM模型等,详见fastcws --help

Windows 注意事项

Windows平台上,默认的编码是utf16,但是本项目目前只使用utf8作为唯一编码。

在直接用命令行界面进行输入时,无需考虑此问题,因为工具使用了nowide进行自动转换:

$ fastcws
在春风吹拂的季节翩翩起舞
在/春风/吹拂/的/季节/翩翩起舞/

在使用管道分词文件时,必须确认文件以utf8格式保存且不带 BOM,否则可能导致分词工作不正常或者出现错误:

$ type input.txt | fastcws.exe > output.txt

必须保证input.txt是以utf8格式保存的。

C语言函数库

本项目以c++17写成,不过可以使用编译得到的动态链接库,以稳定的 C 语言 API 调用分词组件:

// #include "libfastcws.h"

fastcws_init();
fastcws_result* result = fastcws_alloc_result();

int err = fastcws_word_break("在春风吹拂的季节翩翩起舞", result);
if (err) {
	...
}
const char *word_begin;
size_t word_len;
while(fastcws_result_next(result, &word_begin, &word_len) == 0) {
	...
}
fastcws_result_free(result);

如你所见,分词是0拷贝的,因此性能十分优秀。

此外,C API 同样支持从文件加载词典、HMM模型等。examples目录下有更多范例可供参考。

同样需要注意的是,传入的数据编码必须是utf8

编译安装

和多数cmake项目一样:

git submodule update --init --recursive
cmake -S . -B build
cmake --build build
cmake --build build --target install

fastcws's People

Contributors

omegacoleman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fastcws's Issues

苹果机死循环

问题是苹果机的in_avail一直是0,这是个绕过: 在include/fastcws/sentence_split/tokenizer.hpp:

size_t sz = std::min<size_t>(buffer.size(), is_.rdbuf()->in_avail());
#ifdef __APPLE__
if (sz==0) sz=1;
#endif

cmake --build build 时报错

cmake --build build时报错
[root@idtcentos7 fastcws]# cmake --build build
[ 1%] Building C object external/zlib/CMakeFiles/zlib.dir/adler32.o
[ 2%] Building C object external/zlib/CMakeFiles/zlib.dir/compress.o
[ 3%] Building C object external/zlib/CMakeFiles/zlib.dir/crc32.o
[ 4%] Building C object external/zlib/CMakeFiles/zlib.dir/deflate.o
[ 5%] Building C object external/zlib/CMakeFiles/zlib.dir/gzclose.o
[ 6%] Building C object external/zlib/CMakeFiles/zlib.dir/gzlib.o
[ 7%] Building C object external/zlib/CMakeFiles/zlib.dir/gzread.o
[ 8%] Building C object external/zlib/CMakeFiles/zlib.dir/gzwrite.o
[ 9%] Building C object external/zlib/CMakeFiles/zlib.dir/inflate.o
[ 10%] Building C object external/zlib/CMakeFiles/zlib.dir/infback.o
[ 11%] Building C object external/zlib/CMakeFiles/zlib.dir/inftrees.o
[ 12%] Building C object external/zlib/CMakeFiles/zlib.dir/inffast.o
[ 13%] Building C object external/zlib/CMakeFiles/zlib.dir/trees.o
[ 14%] Building C object external/zlib/CMakeFiles/zlib.dir/uncompr.o
[ 15%] Building C object external/zlib/CMakeFiles/zlib.dir/zutil.o
[ 17%] Linking C shared library libz.so
[ 17%] Built target zlib
[ 18%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/adler32.o
[ 19%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/compress.o
[ 20%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/crc32.o
[ 21%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/deflate.o
[ 22%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/gzclose.o
[ 23%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/gzlib.o
[ 24%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/gzread.o
[ 25%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/gzwrite.o
[ 26%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/inflate.o
[ 27%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/infback.o
[ 28%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/inftrees.o
[ 29%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/inffast.o
[ 30%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/trees.o
[ 31%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/uncompr.o
[ 32%] Building C object external/zlib/CMakeFiles/zlibstatic.dir/zutil.o
[ 34%] Linking C static library libz.a
[ 34%] Built target zlibstatic
[ 35%] Building C object external/zlib/CMakeFiles/example.dir/test/example.o
[ 36%] Linking C executable example
[ 36%] Built target example
[ 37%] Building C object external/zlib/CMakeFiles/minigzip.dir/test/minigzip.o
/root/fastcws/external/zlib/test/minigzip.c: In function ‘file_uncompress’:
/root/fastcws/external/zlib/test/minigzip.c:503:5: error: unknown type name ‘z_size_t’
z_size_t len = strlen(file);
^
gmake[2]: *** [external/zlib/CMakeFiles/minigzip.dir/test/minigzip.o] Error 1
gmake[1]: *** [external/zlib/CMakeFiles/minigzip.dir/all] Error 2
gmake: *** [all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.