huichen / sego Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 356.0 3.59 MB

Go中文分词

License: Other

Go 93.67% HTML 6.33%

sego's People

Contributors

Stargazers

Watchers

Forkers

alienfeel easygogogo mrcuix cubelite lingling2012 yiiwood hahaya ustcjin mrxiaoz xiaozhen1900 913862627 rli-diraryi flyingliang cjl3080434008 alemic icattlecoder daemonchen lowstz jason-zou fairymane xtaci richardsun guotie fanfannothing ehuos 29n lqshixinlei jinyangchun ijibu antaintan insionng bozzcq molezzz liumingwei archs wytheonly zengnotes tabalt hehaotian trangle seacoastboy jmone cenwei cvley quqiufeng jreamlu holys-playground epicpaas simonpeng2009 sing1ee kevinxu001 lipengyu luzj 9466 tossp zhuweijava jpbirdy linbaozhong ruoshui1126 liuzhiyi cw2018 bilicc on1you whtang sugaofeng morler gfoxiii ckome yl365 andeya tukdesk shacho yuchao1989 pengswift miffa ego008 momaek hidu cluo conc niceeverything sharp adamzy taogogo zofuthan zhanglei banyue liangkai wfxiang08 liufuqiang trigrass2 hengfeiyang chanehua hellokang cn27001 djnxy lonelypale wtmmac adleihao suifengpiao4515

sego's Issues

is there a cpp implementation with a similar algorithms with sego?

Sego adds many extra cmdline flags to the packages that imports sego

hi, huichen
I unconsciously found the sego package would add many extra cmdline flags to the packages that imports sego. And these flags are from testing package.
The code example is like below:

//test.go
package main

import (
    "flag"
    "fmt"

    _ "github.com/huichen/sego"
)

var logfile string

func usage() {
    flag.PrintDefaults()
}

func init() {
    flag.StringVar(&logfile, "logfile", "./log/log.txt", "the log file path")
}
func main() {
    flag.Usage = usage
    flag.Parse()
    fmt.Println("app run ok")
}

run the app：

$go run test.go -help
  -logfile string
        the log file path (default "./log/log.txt")
  -test.bench string
        regular expression per path component to select benchmarks to run
  -test.benchmem
        print memory allocations for benchmarks
  -test.benchtime duration
        approximate run time for each benchmark (default 1s)
  -test.blockprofile string
        write a goroutine blocking profile to the named file after execution
  -test.blockprofilerate int
        if >= 0, calls runtime.SetBlockProfileRate() (default 1)
  -test.count n
        run tests and benchmarks n times (default 1)
  -test.coverprofile string
        write a coverage profile to the named file after execution
  -test.cpu string
        comma-separated list of number of CPUs to use for each test
  -test.cpuprofile string
        write a cpu profile to the named file during execution
  -test.memprofile string
        write a memory profile to the named file after execution
  -test.memprofilerate int
        if >=0, sets runtime.MemProfileRate
  -test.outputdir string
        directory in which to write profiles
  -test.parallel int
        maximum test parallelism (default 4)
  -test.run string
        regular expression to select tests and examples to run
  -test.short
        run smaller test suite to save time
  -test.timeout duration
        if positive, sets an aggregate time limit for all tests
  -test.trace string
        write an execution trace to the named file after execution
  -test.v

verbose: print additional output
exit status 2

I did some debugging and I found the reason for this is that test_utils.go in sego imports "testing". I think it is not a good experience for sego users.

A simple approach to fix it is that renaming test_utils.go to utils_test.go. After this, the compiler will ignore the content in this file when building non-test package.

例子好像有问题：“中华人民共和国**人民政府”被分成了一个词

中华人民共和国**人民政府被分成了一个词。

中华人民共和国**人民政府/nt

错误：当词典只有一个关键词并且该关键词在句首时，无法得到该分词

字典文件内容：

张三 3 n

程序：

	var sgr sego.Segmenter
	sgr.LoadDictionary("main.dic")
	var words []string
	for _, sg := range sgr.Segment([]byte("张三，你好啊")) {
		token := sg.Token()
		words = append(words, fmt.Sprintf("%s/%s", token.Text(), token.Pos()))
	}
	fmt.Println(strings.Join(words, " "))
//      张/x 三/x ，/x 你/x 好/x 啊/x
	
	words = words[:0]
	for _, sg := range sgr.Segment([]byte("你好啊，张三")) {
		token := sg.Token()
		words = append(words, fmt.Sprintf("%s/%s", token.Text(), token.Pos()))
	}
	fmt.Println(strings.Join(words, " "))
//      你/x 好/x 啊/x ，/x 张三/n

======
log.Printf("载入sego词典 %s", file)

打印日志这一行可以去掉吗？
一启动都输出到控制台，很影响正常调试

不能获取被匹配到的词么？

比如：中华人民共和国，我只要分出中华，返回 [中华]，而不是 [中华人民共和国]

请问sego的词库是不是有什么工具生成的?想对目前的词库进行扩容.

怎样可以不输出日志？

怎样可以不输出类似下面这样的日志？

2017/07/08 19:45:34 载入sego词典 dictionary.txt
2017/07/08 19:45:37 sego词典载入完毕

在segmenter.go里，下面的代码是写死的，能否提供一个选项不输出日志？

log.Printf("载入sego词典 %s", file)
log.Fatalf("无法载入字典文件 \"%s\" \n", file)
log.Println("sego词典载入完毕")

『无法载入字典文件』，这个能让LoadDictionary()返回一个error吗？

是否支持拼音分词

你好，请问该包是否支持拼音分词，如 wo shi zhong guo ren 是否可分词为 wo/shi/zhongguo/ren，如何做才能支持该功能，谢谢！

自定义函数toLower，是否可以替换为系统函数？

segmenter.go Line 285-295 自定义了函数toLower，能否替换为strings.ToLower()？

有没办法支持从filesystem加载字典例如packr pkger

方便数据也打包到一起如果可以的话我也可以提个pr

重新载入字典时，新加入的词未加入新的对象中

重新调用Segmenter.LoadDictionary(filename)时，Segmenter地址已经发生改变，但是字典内容并未发生改变，重启项目后，字典载入正常，新加入的词也加入词典，有人知道时什么问题吗

Is this routine support user defined dict?

Is this routine support user defined dict?
I have a big dict of myself, replace sego's default dict with this will be OK?

当文本中出现\x01这种特殊字符时，会造成分词错误

如果输入是 "测试\x01\x01分词输入"
分词的结果会是测试/vn 分词/n 分词/n 输入/v
多出来的内容貌似是跟\x01的个数相关的

分词时应将原词加入到结果

比如demo中的词：**互联网历史上最大的一笔并购案
分词后的结果：**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n
应该把原词加入到结果：**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n **互联网历史上最大的一笔并购案

关于splitTextToWords的多语言支持

目前segmenter.go中的splitTextToWords函数，将会把所有non-english语言，分解为最小单位。

除了CJK中日韩等东亚语言，其它国家的语言都还是类似英语，属于字母型语言，利用unicode包中的IsLetter、IsNumber函数，可以很方便的处理。因此，建议将
_, size := utf8.DecodeRune(text[current:])
if size == 1 &&
(text[current] >= 'a' && text[current] <= 'z') ||
(text[current] >= 'A' && text[current] <= 'Z') ||
(text[current] >= '0' && text[current] <= '9') {

改为
r, size := utf8.DecodeRune(text[current:])
if unicode.IsLetter(r) || unicode.IsNumber(r) {

这样sego基本上可以用于所有的语言。

huichen / sego Goto Github PK

sego's People

Contributors

Stargazers

Watchers

Forkers

sego's Issues

====== log.Printf("载入sego词典 %s", file)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

======
log.Printf("载入sego词典 %s", file)