thunlp / thulac Goto Github PK

View Code? Open in Web Editor NEW

780.0 43.0 171.0 96 KB

An Efficient Lexical Analyzer for Chinese

License: MIT License

Makefile 0.92% C++ 98.58% CMake 0.33% C 0.17%

chinese-nlp

thulac's Introduction

THULAC：一个高效的中文词法分析工具包

项目介绍

THULAC（THU Lexical Analyzer for Chinese）由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包，具有中文分词和词性标注功能。THULAC具有如下几个特点：

能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库（约含5800万字）训练而成，模型标注能力强大。
准确率高。该工具包在标准数据集Chinese Treebank（CTB5）上分词的F1值可达97.3％，词性标注的F1值可达到92.9％，与该数据集上最好方法效果相当。
速度较快。同时进行分词和词性标注速度为300KB/s，每秒可处理约15万字。只进行分词速度可达到1.3MB/s。

编译和安装

C++版

  在当前路径下运行
  make
  会在当前目录下得到thulac和train_c

使用方式

1.分词和词性标注程序

1.1.命令格式

C++版
- ./thulac [-t2s] [-seg_only] [-deli delimeter] [-user userword.txt] 从命令行输入输出
- ./thulac [-t2s] [-seg_only] [-deli delimeter] [-user userword.txt] [-intput inputfile] [-output outputfile] 从文本文件输入输出（注意均为UTF8文本）

1.2.通用参数

-t2s			    将句子从繁体转化为简体
-seg_only		    只进行分词，不进行词性标注
-deli delimeter		设置词与词性间的分隔符，默认为下划线_
-filter				使用过滤器去除一些没有意义的词语，例如“可以”。
-user userword.txt	设置用户词典，用户词典中的词会被打上uw标签。词典中每一个词一行，UTF8编码(python版暂无)
-model_dir dir		设置模型文件所在文件夹，默认为models/
-input inputfile	设置输入文件地址
-output outputfile	设置输出文件地址

1.3.接口使用示例

新版的THULAC提供了分词和词性标注接口，将include文件夹拷贝到自己工程下的include中，通过在程序中引用include"thulac.h"，即可调用thulac提供的功能。

具体的使用方法可以参考src/thulac.cc文件。

1.4.接口参数

首先需要实例化THULAC类，然后可以调用以下接口：

int init(const char* model_path = NULL, const char* user_path = NULL, int just_seg = 0, int t2s = 0, int ufilter = 0, char separator = '_');初始化类，进行自定义设置。

  user_path           设置用户词典，用户词典中的词会被打上uw标签。词典中每一个词一行，UTF8编码
  t2s                 默认False, 是否将句子从繁体转化为简体
  just_seg            默认False, 时候只进行分词，不进行词性标注
  ufilter             默认False, 是否使用过滤器去除一些没有意义的词语，例如“可以”。
  model_path          设置模型文件所在文件夹，默认为models/
  separator           默认为‘_’, 设置词与词性之间的分隔符

int cut(const std::string&, THULAC_result& result);输入一个待分词和词性标注的字符串和THULAC_result类型变量，结果会存储在result中。

THULAC_result类型为std::vector<std::pair<std::string, std::string> >的重定义。即cut函数返回结果为std::vector<std::pair<分词,词性> >。如果只分词，那么词性会是''（空字符串）。
THULAC_result& multiTreadCut(const std::string &in, THULAC& lac, int thread);输入一个待分词和词性标注的字符串，一个THULAC实例，线程数，返回THULAC_result类型变量

1.5.分词和词性标注模型的使用

THULAC需要分词和词性标注模型的支持，用户可以登录thulac.thunlp.org网站填写个人信息进行下载，并放到THULAC的根目录即可，或者使用参数-model_dir dir指定模型的位置。

2.模型训练程序

模型训练程序train_c是THULAC分词模型的训练程序，用户可以使用train_c训练获得THULAC的分词模型。

2.1.命令格式

	./train_c [-s separator] [-b bigram_threshold] [-i iteration] training_filename model_filename   
	使用training_filename为训练集，训练出来的模型名字为model_filename

2.2.参数意义

	-s 				设置词与词性间的分隔符，默认为斜线/
	-b				设置二字串的阀值，默认为1
	-i				设置训练迭代的轮数，默认为15

2.3.训练集格式

我们使用默认的分隔符（斜线/）作为例子，训练集内容应为

	我/r 爱/vm 北京/ns ***/ns

类似的已经进行词性标注的句子。

若要训练出只分词的模型，使用默认的分隔符（斜线/）作为例子，训练集内容应为

	我/ 爱/ 北京/ ***/

类似的句子。

2.4.使用训练出的模型

将训练出来的模型覆盖原来models中的对应模型，之后执行分词程序即可使用训练出来的模型。

3.获取模型

获取已经训练好的THULAC模型，请登录thulac.thunlp.org网站填写个人信息进行下载。

代表分词软件的性能对比

我们选择LTP、ICTCLAS、结巴分词等国内代表分词软件与THULAC做性能比较。我们选择Windows作为测试环境，根据第二届国际汉语分词测评发布的国际中文分词测评标准，对不同软件进行了速度和准确率测试。

在第二届国际汉语分词测评中，共有四家单位提供的测试语料（Academia Sinica、 City University 、Peking University 、Microsoft Research）, 在评测提供的资源icwb2-data中包含了来自这四家单位的训练集（training）、测试集（testing）, 以及根据各自分词标准而提供的相应测试集的标准答案（icwb2-data/scripts/gold）．在icwb2-data/scripts目录下含有对分词进行自动评分的perl脚本score。

我们在统一测试环境下，对若干流行分词软件和THULAC进行了测试，使用的模型为各分词软件自带模型。THULAC使用的是随软件提供的简单模型Model_1。评测环境为 Intel Core i5 2.4 GHz 评测结果如下：

msr_test（560KB）

Algorithm	Time	Precision	Recall
LTP-3.2.0	3.21s	0.867	0.896
ICTCLAS(2015版)	0.55s	0.869	0.914
jieba	0.26s	0.814	0.809
THULAC	0.62s	0.877	0.899

pku_test（510KB）

Algorithm	Time	Precision	Recall
LTP-3.2.0	3.83s	0.960	0.947
ICTCLAS(2015版)	0.53s	0.939	0.944
jieba	0.23s	0.850	0.784
THULAC	0.51s	0.944	0.908

除了以上在标准测试集上的评测，我们也对各个分词工具在大数据上的速度进行了评测，结果如下：

CNKI_journal.txt（51 MB）

Algorithm	Time	Speed
LTP-3.2.0	348.624s	149.80KB/s
ICTCLAS(2015版)	106.461s	490.59KB/s
jieba	22.5583s	2314.89KB/s
THULAC	42.625s	1221.05KB/s

词性解释

n/名词 np/人名 ns/地名 ni/机构名 nz/其它专名
m/数词 q/量词 mq/数量词 t/时间词 f/方位词 s/处所词
v/动词 vm/能愿动词 vd/趋向动词 a/形容词 d/副词
h/前接成分 k/后接成分 i/习语 j/简称
r/代词 c/连词 p/介词 u/助词 y/语气助词
e/叹词 o/拟声词 g/语素 w/标点 x/其它

THULAC模型介绍

我们随THULAC源代码附带了简单的分词模型Model_1，仅支持分词功能。该模型由人民日报分词语料库训练得到。
我们随THULAC源代码附带了分词和词性标注联合模型Model_2，支持同时分词和词性标注功能。该模型由人民日报分词和词性标注语料库训练得到。
我们还提供更复杂、完善和精确的分词和词性标注联合模型Model_3和分词词表。该模型是由多语料联合训练训练得到（语料包括来自多文体的标注文本和人民日报标注文本等）。由于模型较大，如有机构或个人需要，请填写“doc/资源申请表.doc”，并发送至 [email protected] ，通过审核后我们会将相关资源发送给联系人。

注意事项

该工具目前仅处理UTF8编码中文文本，之后会逐渐增加支持其他编码的功能，敬请期待。

其他语言实现

历史

更新时间	更新内容
2016-09-29	增加THULAC分词so版本。
2016-03-31	增加THULAC分词python版本。
2016-01-20	增加THULAC分词Java版本。
2016-01-10	开源THULAC分词工具C++版本。

开源协议

THULAC面向国内外大学、研究所、企业以及个人用于研究目的免费开放源代码。
如有机构或个人拟将THULAC用于商业目的，请发邮件至[email protected]洽谈技术许可协议。
欢迎对该工具包提出任何宝贵意见和建议。请发邮件至[email protected]。
如果您在THULAC基础上发表论文或取得科研成果，请您在发表论文和申报成果时声明“使用了清华大学THULAC”，并按如下格式引用：
- 中文：孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC：一个高效的中文词法分析工具包. 2016.
- 英文： Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.

作者

Maosong Sun （孙茂松，导师）, Xinxiong Chen（陈新雄，博士生）, Kaixu Zhang (张开旭，硕士生）, Zhipeng Guo（郭志芃，本科生）, Zhiyuan Liu（刘知远，助理教授）.

thulac's People

Contributors

Stargazers

Watchers

Forkers

cindyzsh mrqianjinsi chagge shannonyu xuanhan863 2php zeehu jankim uestcxi njdragonfly stevenlol xcbat pwecar liuzl clear-datacenter stefanwade zhaoguochen jgzmlmf vsooda xchuwenbo qhduan hasmorebug xiangyu waiteryee1 wynshiter aa10000 kimiqq cgy1989 zhhengcs zhangf911 songyandong mbbill adeptmind majunhua not-panda linguistics-ninja jmwy skiloop joyfunelevator hfxunlp fence mazhongrui ajoeajoe dingyz12 kissthink flyscofield whitecatfly zhiyanhu infodog laurencecao onemoreglass yaozhq wolfalonelj hanksantford yifangt zjpanbin guhaifudeng saliormoon byronhe shadowiterator mengzhaoji zgsxwsdxg linpingchuan bloodd kanven phoenixkbb 0ldm0s jacksn2014 roottan cheneyfan roceys joyle watterzhu eminemrain iamalbert kryptonites zhangranshu shoutanyoule sunlight12345 xielm12 lonelam chen188 hansen06 sunbin728 witwing yaoqi losk-x andyrbm yehuangcn balancewing jack12xl zxnj caolegebi lbqin dongwandou haojunyu chenaoxd dszzm huguanglong jeremystarfall

thulac's Issues

文档及模型参数提示信息的一处错误

#1.1.命令格式
C++版
./thulac [-t2s] [-seg_only] [-deli delimeter] [-user userword.txt] 从命令行输入输出
./thulac [-t2s] [-seg_only] [-deli delimeter] [-user userword.txt] [-intput inputfile] [-output outputfile] 从文本文件输入输出（注意均为UTF8文本）

发现C++版本README.md及程序提示中的一处错误，输入文件的参数应为“ -input”，而非“intput”。

how to trainging on private corpus?

作者c++工程水平急需提高

打开警告选项红字满天飞
默认执行还会coredump

train 大文件训练材料的时候会coredump

Program received signal SIGSEGV, Segmentation fault.
_int_realloc (av=av@entry=0x7ffff7830b20 <main_arena>, oldp=oldp@entry=0x1aa9d330, oldsize=oldsize@entry=1451880, nb=nb@entry=1048592) at malloc.c:4262
4262 malloc.c: No such file or directory.
(gdb) bt
#0 _int_realloc (av=av@entry=0x7ffff7830b20 <main_arena>, oldp=oldp@entry=0x1aa9d330, oldsize=oldsize@entry=1451880, nb=nb@entry=1048592) at malloc.c:4262
#1 0x00007ffff74f0839 in __GI___libc_realloc (oldmem=0x1aa9d340, bytes=1048576) at malloc.c:3045
#2 0x000000000041a447 in thulac::DATMaker::extends() ()
#3 0x000000000041a626 in thulac::DATMaker::alloc(std::vector<int, std::allocator >&) ()
#4 0x000000000041a902 in thulac::DATMaker::assign(int, std::vector<int, std::allocator >&, int) ()
#5 0x000000000041aceb in thulac::DATMaker::make_dat(std::vector<thulac::DATMaker::KeyValue, std::allocatorthulac::DATMaker::KeyValue >&, int) ()
#6 0x00000000004189cd in thulac::TaggingLearner::train(char const*, char const*, char const*, char const*) ()
#7 0x0000000000419970 in main ()

模型训练

HI,
我仿着你们的方式去训练我们模型，但是训练了5个多小时了，结果还没出来，语料数目在20万。不知道是不是哪边出了问题，你们这个训练出来的统计分词有没有和CRF 比较过呀？效果比起来如何？
./train_c weibo_data_process model
都是按照默认的。
输出日志：
separator: [/]
training file "weibo_data_process" scanned

然后就一直这样，是卡住了？有其他日志输出吗？还是一直就这样。语料在20万左右的话训练很久吗？

谢谢。

Hadoop streaming不能读取model

我在本地编译之后，可以执行成功，但是通过Hadoop streaming 的方式执行分词的话，会报错，说模型数据文件找不到，但是我已经设置了Hadoop对文件进行分发。

北京市政府分割错误

北京市政府这些被分割成了北京市政府

请问当字符串超过一定长度，thulac就不能分词了吗？

您好，我用thulac进行中文分词
thu1= thulac.thulac("-seg_only") s=thu1.cut(t)
t 是我的字符串，但是当 t 的长度比较长的时候，程序就会报错：
File "tran.py", line 10, in seperate s=thu1.cut(t) File "/Users/liuchong/anaconda/lib/python2.7/thulac/__init__.py", line 100, in cut tmp, tagged = self.cws_tagging_decoder.segmentTag(raw, poc_cands) File "/Users/liuchong/anaconda/lib/python2.7/thulac/character/CBTaggingDecoder.py", line 165, in segmentTag self.allowedLabelLists[i] = self.pocsToTags[pocs] IndexError: list assignment index out of range
请问这是因为当字符串长度超过一定限度的时候， thulac 就不能工作了吗？
谢谢您。

segment fault error when seg

环境是 RHEL，g++ 4.8.5，命令是 ./thulac -t2s -seg_only -model_dir models/ < ./socorpus.xx.tx，运行了一会儿遇到特定的一行才出错，错误是 segmentation fault (core dumped)

原始语料是 sogouCA 提取 <content>，用 iconv 转码成 UTF8 的文本。
把出问题的句子附近文本单独拿出来试了试，好像也会 segment fault，因此附在后面，敬请查阅：
socorpus.xx.txt

readme中有不良链接

icwb2-data 这个地址的链接，http://sighan.org/bakeoff2005/ 是个黄色网址

区域，时间等这些模型数据是如何训练出来的，可以修改吗？

neg.dat
ns.dat
singlepun.dat
t2s.dat
time.dat
xu.dat

你好请问一下，算法原理是用的什么模型，各个版本一样么

你好请问一下，算法原理是用的什么模型，各个版本一样么？
我看python版本的好像用的crf，c++版本的呢？
我在网上找的资料有的说用的是结构化感知机，比较迷惑，求指导

thulac pro 在mac上的编译问题

直接make出现了Undefined symbols for architecture x86_64: "_libiconv"的问题，帮助文档里说“修改makefile文件，在编译命令中添加 -liconv ”，请问-liconv加在哪里？修改$(cxx) $(src_dir)/thulac.cc -o $(dst_dir)/thulac，变成$(cxx) $(src_dir)/thulac.cc -o $(dst_dir)/thulac -static -liconv重新编译，出现了新的error，“ld: library not found for -lcrt0.o”

多线程支持

请问有没有可能支持多线程分词？就是Model只加载一次，然后多个线程分别对不同的分本进行处理？

make成功后，运行thulac报Segmentation fault (core dumped)

类似的，那个THULAC.so编译后，python使用时，也报Segmentation fault (core dumped)

Buffer overflow occurred during training process

When I try to run program train_c with the command line :

./train_c train_file outfile

The address sanitizer found a heap buffer overflow issue:

=================================================================
==11181==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f28a77a201c at pc 0x000000415d9d bp 0x7ffc313dd090 sp 0x7ffc313dd080
READ of size 4 at 0x7f28a77a201c thread T0
    #0 0x415d9c in thulac::NGramFeature::find_bases(int, int, int, int&, int&) include/cb_ngram_feature.h:248
    #1 0x415d9c in thulac::NGramFeature::put_values(int*, int) include/cb_ngram_feature.h:118
    #2 0x415d9c in thulac::TaggingDecoder::put_values() include/cb_tagging_decoder.h:387
    #3 0x42c888 in thulac::TaggingLearner::train(char const*, char const*, char const*, char const*) include/cb_tagging_learner.h:305
    #4 0x404239 in main src/train_c.cc:62
    #5 0x7f28aa28d82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #6 0x404c98 in _start (/home/mfc_fuzz/newprogram/THULAC/train_c+0x404c98)

0x7f28a77a201c is located 202780 bytes inside of 524288-byte region [0x7f28a7770800,0x7f28a77f0800)
freed by thread T0 here:
    #0 0x7f28aac67961 in realloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98961)
    #1 0x426a03 in thulac::DATMaker::shrink() include/dat.h:221
    #2 0x426a03 in thulac::TaggingLearner::train(char const*, char const*, char const*, char const*) include/cb_tagging_learner.h:205
    #3 0xfff8627bb59  (<unknown module>)

previously allocated by thread T0 here:
    #0 0x7f28aac67961 in realloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98961)
    #1 0x43261e in thulac::DATMaker::extends() include/dat.h:207
    #2 0x43261e in thulac::DATMaker::alloc(std::vector<int, std::allocator<int> >&) include/dat.h:235
    #3 0x43261e in thulac::DATMaker::assign(int, std::vector<int, std::allocator<int> >&, int) include/dat.h:270

SUMMARY: AddressSanitizer: heap-use-after-free include/cb_ngram_feature.h:248 thulac::NGramFeature::find_bases(int, int, int, int&, int&)
Shadow bytes around the buggy address:
  0x0fe594eec3b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec3c0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec3d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec3e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec3f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0fe594eec400: fd fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec410: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec420: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec430: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec440: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0fe594eec450: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
==11181==ABORTING

The input file I tried to give to the program train_c contains only

我/r 爱/vm 北京/ns ***/ns

as you suggested in your document.

英文分词时候，标点符号分割错误

hello word. ---> hello_x word._x

请问Windows端如何使用

我按照文档导入到工程中后，头文件里有很多 error: multiple definition of `thulac::operator>>(std::istream&, int&)' 类似的错误。

作者您好，我想问一下训练数据我在哪里还能获得？

学习目的

import thulac 后分词和标注速度为什么特别慢

import thulac 后
thu1 = thulac.thulac("-seg_only") #设置模式为行分词模式
thu2 = thulac.thulac("-deli /") #标注模式

thu1.cut(）
500KB的文字要4.63秒
thu2.cut()
标注模式也转不动

使用训练模型，带有数字的词被错分的问题

我使用THULAC的训练器基于自己的训练数据获得模型，用新的模型对词进行分词和标注词性，发现有数字的地方被错分，比如：

训练数据示例如下：
北京市/city 昌平区/district 北六环/road 59号/house_number 小汤山桥/name
对“北京市昌平区北六环59号小汤山桥”分词，得到的结果是这样的：
北京市_city 昌平区_district 北六环_road 59_name 号小汤山桥_name

mac g++ 编译报错1个语法错误1个async错误1个call private错误

您好，mac环境

configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1 Apple LLVM version 10.0.0 (clang-1000.11.45.5) Target: x86_64-apple-darwin18.7.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

报错信息：
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/thulac.h:347:42: error: a space is required between consecutive right angle brackets (use '> >') std::vector<std::future<THULAC_result>> t; ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/thulac.h:352:26: error: no member named 'async' in namespace 'std' t.push_back(std::async(&cut, splited[i], lac)); ~~~~~^ In file included from segtest.cpp:1: In file included from /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/thulac.h:3: In file included from /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/preprocess.h:2: In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector:266: In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__bit_reference:15: In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:643: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:1574:36: error: calling a private constructor of class 'std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > >' ::new ((void*)__p) _Tp(__a0); ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector:1593:25: note: in instantiation of function template specialization 'std::__1::allocator_traits<std::__1::allocator<std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > > > >::construct<std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > >, std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > > >' requested here __alloc_traits::construct(this->__alloc(), ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/thulac.h:352:11: note: in instantiation of member function 'std::__1::vector<std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > >, std::__1::allocator<std::__1::future<std::__1::vector<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char>, std::__1::basic_string<char> > > > > > >::push_back' requested here t.push_back(std::async(&cut, splited[i], lac)); ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/future:1121:5: note: declared private here future(const future&); ^ 3 errors generated.

为什么我这里会报语法错误
请问下是什么原因？是c++版本原因导致的吗？

segfault during training

➜

THULAC git:(master) ✗ ./train_c pku_test.utf8.txt model_test 
separator: [/]
training file "pku_test.utf8.txt" scanned
DAT (double array TRIE) file "model_test_dat.bin" created
number of labels: 
number of features: 
235588
model file "model_test_model.bin" created
label file "model_test_label.txt" created
[1]    9169 segmentation fault (core dumped)  ./train_c pku_test.utf8.txt model_test

pku_test.utf8.txt file is here.

thanks

core down when training_file is large, how to deal with it?

When the training_file is large, the training process cannot move on.

The terminal shows:
train_c: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.

cws_label.txt file not find Segmentation fault: 11

在使用thulac测试的时候报错cws_label.txt是什么内容？为何文档里没有说明

中英文混合切词

hi，请问现在支持中英文混合切词的么？我这里测试是没有正确切开的。
In [4]: for t, f in seg.cut('this is a test sentence. 这个是计算广告的数据啊'):
...: print('%s %s' % (t, f))
...:
this x
v
i g
s g
g
a g
g
test np
v
sentence x
. w
j
这个 r
是 v
计算 v
广告 n
的 u
数据 n
啊 u

cut形参中的字符串长度太长，程序会崩溃

内存使用太大、模型需要压缩

内存使用太大

model文件没有做压缩，太大
内存使用太大：一下子申请很大内存。

可以考虑对model文件压缩（我对cws_dat压缩后只有18M，原来大约60M），然后边使用边解压。这样不model文件和内存都可以降低很多。

mac系统下make总是失败

执行make编译C++工程时错误如下：

In file included from src/thulac.cc:7:
src/thulac.h:18:5: warning: control reaches end of non-void function
[-Wreturn-type]
}
^
25 warnings generated.
Undefined symbols for architecture x86_64:
"_iconv", referenced from:
_main in thulac-444612.o
"_iconv_close", referenced from:
thulac::Chinese_Charset_Conv::~Chinese_Charset_Conv() in thulac-444612.o
"_iconv_open", referenced from:
thulac::Chinese_Charset_Conv::Chinese_Charset_Conv() in thulac-444612.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [thulac] Error 1

已经在.bash_profile里将g++和gcc给alias到了gnu gcc上，但是还是报clang错误。已经折腾整整两天了，望给思路，谢谢！

编译方法不明确

程序里面没有sln的解决方案文件，导入源代码没有办法生成解决方案，调用关系很多都没有

无论Python还是c++，lite和pro版至今还没有安装成功。。。

训练时提示longer than max如何解决？

数据量才500K，修改-b 后依然出现这个提示

用户定义词典有时候不起作用

您好，我在使用这个工具的过程中，发现有些自定义的分词不起作用，比如这句话： "最近，勇士老将伊戈达拉道出了实情！"，分词结果是 "勇士/老将/伊戈达拉道/出了/实情", 我自定了“道出了”, 但是貌似分词并没有起作用，结果还是将“伊戈达拉道”分成了一个词。请问这是什么原因？谢谢~

分词和词性标注原理是什么？

您好，看了您给的相关论文： Punctuation as Implicit Annotations for Chinese Word Segmentation. Computational Linguistics，和c++训练代码，感觉代码和论文讲的不一样，不明白您的代码的分词和词性标注原理是什么？能讲讲吗？@gzp9595

Alloc-dealloc-mismatch

When I give an empty file to the program train_c, I found this issue:

=================================================================
==19071==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x7f51e6a9b800
    #0 0x7f51e5ac4b2a in operator delete(void*) (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x99b2a)
    #1 0x404318 in thulac::TaggingLearner::~TaggingLearner() include/cb_tagging_learner.h:48
    #2 0x404318 in main src/train_c.cc:65
    #3 0x7f51e50e982f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #4 0x404c98 in _start (/home/mfc_fuzz/newprogram/THULAC/train_c+0x404c98)

0x7f51e6a9b800 is located 0 bytes inside of 200000-byte region [0x7f51e6a9b800,0x7f51e6acc540)
allocated by thread T0 here:
    #0 0x7f51e5ac46b2 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x996b2)
    #1 0x403ff8 in thulac::TaggingLearner::TaggingLearner(int, int, int) include/cb_tagging_learner.h:42
    #2 0x403ff8 in main src/train_c.cc:57
    #3 0x464fdf  (/home/mfc_fuzz/newprogram/THULAC/train_c+0x464fdf)

SUMMARY: AddressSanitizer: alloc-dealloc-mismatch ??:0 operator delete(void*)
==19071==HINT: if you don't care about these warnings you may set ASAN_OPTIONS=alloc_dealloc_mismatch=0
==19071==ABORTING

在Windows

除了一些在无符号类型前面使用符号的行为引发的error，还有很多莫名其妙未定义的错误（其实已经定义过了）

分词和词性标注速度

官网公布的词性标注速度可以达到300KB/s，对应的标签数是多少？标签数对于词性标注速度影响很大，虽然预处理可以减少解码空间，但还是有限的。我这边测试时，标签数是342时，处理速度仅为27KB/s。相差太大，请问有人知道怎么解决吗？

内存占用3个多G，正常吗

在import thulac之后，内存飚了3个多G，这个正常吗？有办法降低内存使用吗？

Where're Model_1, Model_2 and Model_3 ?

As you mention in "THULAC模型介绍", I could not find these three model. Thanks a lot !

分词1G的文件总共1000万行，但是最终就只有380万左右的分词结果

分词1G的文件总共1000万行，但是最终就只有380万左右的分词结果，这是软件的bug?

有计划开发ruby版吗？

使用python版的体验还挺不错，比起stanza很本地化，效果挺满意，故如题。

拼音分词

如果能支持拼音分词最好了。例如：北京gugong
目前不能分割gugong

不支持Tigerlake架构的Intel cpu编译

make时提示如下错误，看起来是枚举了intel的cpu架构，我这11代Intel的好像不支持

jyy@YUYIJIANG-NB0:/mnt/c/Users/jiangyuyi/THULAC$ sudo make g++ -std=c++11 -O3 -march=native -I include src/thulac.cc -o ./thulac cc1plus: error: bad value (‘tigerlake’) for ‘-march=’ switch cc1plus: note: valid arguments to ‘-march=’ switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 btver1 btver2 native cc1plus: error: bad value (‘tigerlake’) for ‘-mtune=’ switch cc1plus: note: valid arguments to ‘-mtune=’ switch are: nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 btver1 btver2 generic native make: *** [Makefile:12: thulac] Error 1

SEGV signal occurred when running program thulac

When I try to run thulac and thulac_test program. I found this :

ASAN:SIGSEGV
=================================================================
==12976==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fbf3a4841ba bp 0x000000000000 sp 0x7ffc53739440 T0)
    #0 0x7fbf3a4841b9 in _IO_fread (/lib/x86_64-linux-gnu/libc.so.6+0x6e1b9)
    #1 0x442c38 in fread /usr/include/x86_64-linux-gnu/bits/stdio2.h:295
    #2 0x442c38 in permm::BasicModel<int>::BasicModel(char const*) include/cb_model.h:89
    #3 0x436b5b in THULAC::init(char const*, char const*, int, int, int, char) include/thulac.h:157
    #4 0x404962 in main src/thulac.cc:80
    #5 0x7fbf3a43682f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #6 0x4054e8 in _start (/home/mfc_fuzz/newprogram/THULAC/thulac+0x4054e8)

编译test_case.cpp时头文件出错

图片上传不了，直接贴log了。
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(37): error C2039: “dat”: 不是“thulac::NGramFeature”的成员
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(14): note: 参见“thulac::NGramFeature”的声明
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(62): warning C4018: “<”: 有符号/无符号不匹配
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(66): warning C4018: “<”: 有符号/无符号不匹配
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(67): warning C4018: “<”: 有符号/无符号不匹配
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(128): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(132): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(134): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(136): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(138): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(140): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(142): error C3861: “add_values”: 找不到标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(177): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(177): error C2228: “.base”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(178): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(178): error C2228: “.check”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(179): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(179): error C2228: “.base”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(246): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(246): error C2228: “.check”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(249): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(249): error C2228: “.base”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(250): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(250): error C2228: “.base”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(251): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(251): error C2228: “.check”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(254): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_ngram_feature.h(254): error C2228: “.base”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h : warning C4819: 该文件包含不能在当前代码页(936)中表示的字符。请将该文件保存为 Unicode 格式以防止数据丢失
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(31): error C2061: 语法错误: 标识符“Node”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(37): error C2143: 语法错误: 缺少“)”(在“;”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C2146: 语法错误: 缺少“;”(在标识符“best”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C3927: "->": 非函数声明符后不允许尾随返回类型
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C3484: 语法错误: 返回类型前应为“->”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C3613: “->”后缺少返回类型(假定为“int”)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): error C2146: 语法错误: 缺少“;”(在标识符“node_id”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(42): error C2143: 语法错误: 缺少“;”(在“*”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(42): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(42): error C2086: “int Alpha_Beta”: 重定义
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(41): note: 参见“Alpha_Beta”的声明
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2059: 语法错误:“for”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2143: 语法错误: 缺少“)”(在“;”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2143: 语法错误: 缺少“;”(在“<”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2143: 语法错误: 缺少“;”(在“++”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2086: “int i”: 重定义
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): note: 参见“i”的声明
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): error C2059: 语法错误:“)”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2059: 语法错误:“for”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2143: 语法错误: 缺少“)”(在“;”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2143: 语法错误: 缺少“;”(在“<”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2086: “int i”: 重定义
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(45): note: 参见“i”的声明
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2143: 语法错误: 缺少“;”(在“++”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2059: 语法错误:“)”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2143: 语法错误: 缺少“;”(在“{”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(46): error C2447: “{”: 缺少函数标题(是否是老式的形式表?)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(95): error C2059: 语法错误:“}”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(95): error C2143: 语法错误: 缺少“;”(在“}”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(98): error C2059: 语法错误:“while”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(98): error C2143: 语法错误: 缺少“;”(在“{”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(98): error C2447: “{”: 缺少函数标题(是否是老式的形式表?)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(117): error C2059: 语法错误:“return”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(118): error C2059: 语法错误:“}”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(118): error C2143: 语法错误: 缺少“;”(在“}”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(121): error C2059: 语法错误:“}”
d:\programing\ml\thulac\thulac-master\include\cb_decoder.h(121): error C2143: 语法错误: 缺少“;”(在“}”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(13): error C2143: 语法错误: 缺少“;”(在“{”的前面)
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(13): error C2447: “{”: 缺少函数标题(是否是老式的形式表?)
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(94): error C2653: “TaggingDecoder”: 不是类或命名空间名称
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(94): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(95): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(95): error C2227: “->separator”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(96): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(96): error C2227: “->max_length”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(97): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(97): error C2227: “->sequence”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(97): error C2227: “->max_length”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(97): error C3078: 必须在新的表达式中指定数组大小
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(98): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(98): error C2227: “->allowed_label_lists”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(98): error C2227: “->max_length”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(98): error C3078: 必须在新的表达式中指定数组大小
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(99): error C2065: “pocs_to_tags”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(101): error C2065: “ngram_feature”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(105): error C2065: “dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(106): error C2065: “is_old_type_dat”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(108): error C2065: “nodes”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(108): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(108): error C2227: “->max_length”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(108): error C3078: 必须在新的表达式中指定数组大小
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(110): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(110): error C2227: “->label_trans”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(111): error C2065: “label_trans_pre”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(112): error C2065: “label_trans_post”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(115): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(115): error C2227: “->tag_size”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(118): error C2355: “this”: 只能在非静态成员函数或非静态数据成员初始值设定项的内部引用
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(118): error C2227: “->model”的左边必须指向类/结构/联合/泛型类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(120): error C2065: “alphas”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(121): error C2065: “betas”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(123): warning C4508: “TaggingDecoder”: 函数应返回一个值；假定“void”返回类型
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(124): error C2653: “TaggingDecoder”: 不是类或命名空间名称
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(124): error C4430: 缺少类型说明符 - 假定为 int。注意: C++ 不支持默认 int
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(124): error C2084: 函数“int TaggingDecoder(void)”已有主体
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(94): note: 参见“TaggingDecoder”的前一个定义
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(125): error C2065: “sequence”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(125): error C2541: “delete”: 不能删除不是指针的对象
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(126): error C2065: “allowed_label_lists”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(126): error C2541: “delete”: 不能删除不是指针的对象
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(130): error C2065: “max_length”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(131): error C2065: “nodes”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(131): error C2228: “.predecessors”的左边必须有类/结构/联合
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(132): error C2065: “nodes”: 未声明的标识符
d:\programing\ml\thulac\thulac-master\include\cb_tagging_decoder.h(132): error C2228: “.successors”的左边必须有类/结构/联合