GithubHelp home page GithubHelp logo

sego's People

Contributors

adamzy avatar dtynn avatar huichen avatar kangjiehong avatar lowstz avatar xiaomm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sego's Issues

Sego adds many extra cmdline flags to the packages that imports sego

hi, huichen
I unconsciously found the sego package would add many extra cmdline flags to the packages that imports sego. And these flags are from testing package.
The code example is like below:

//test.go
package main

import (
    "flag"
    "fmt"

    _ "github.com/huichen/sego"
)

var logfile string

func usage() {
    flag.PrintDefaults()
}

func init() {
    flag.StringVar(&logfile, "logfile", "./log/log.txt", "the log file path")
}
func main() {
    flag.Usage = usage
    flag.Parse()
    fmt.Println("app run ok")
}

run the app:

$go run test.go -help
  -logfile string
        the log file path (default "./log/log.txt")
  -test.bench string
        regular expression per path component to select benchmarks to run
  -test.benchmem
        print memory allocations for benchmarks
  -test.benchtime duration
        approximate run time for each benchmark (default 1s)
  -test.blockprofile string
        write a goroutine blocking profile to the named file after execution
  -test.blockprofilerate int
        if >= 0, calls runtime.SetBlockProfileRate() (default 1)
  -test.count n
        run tests and benchmarks n times (default 1)
  -test.coverprofile string
        write a coverage profile to the named file after execution
  -test.cpu string
        comma-separated list of number of CPUs to use for each test
  -test.cpuprofile string
        write a cpu profile to the named file during execution
  -test.memprofile string
        write a memory profile to the named file after execution
  -test.memprofilerate int
        if >=0, sets runtime.MemProfileRate
  -test.outputdir string
        directory in which to write profiles
  -test.parallel int
        maximum test parallelism (default 4)
  -test.run string
        regular expression to select tests and examples to run
  -test.short
        run smaller test suite to save time
  -test.timeout duration
        if positive, sets an aggregate time limit for all tests
  -test.trace string
        write an execution trace to the named file after execution
  -test.v

verbose: print additional output
exit status 2

I did some debugging and I found the reason for this is that test_utils.go in sego imports "testing". I think it is not a good experience for sego users.

A simple approach to fix it is that renaming test_utils.go to utils_test.go. After this, the compiler will ignore the content in this file when building non-test package.

错误:当词典只有一个关键词并且该关键词在句首时,无法得到该分词

字典文件内容:

张三 3 n

程序:

	var sgr sego.Segmenter
	sgr.LoadDictionary("main.dic")
	var words []string
	for _, sg := range sgr.Segment([]byte("张三,你好啊")) {
		token := sg.Token()
		words = append(words, fmt.Sprintf("%s/%s", token.Text(), token.Pos()))
	}
	fmt.Println(strings.Join(words, " "))
//      张/x 三/x ,/x 你/x 好/x 啊/x
	
	words = words[:0]
	for _, sg := range sgr.Segment([]byte("你好啊,张三")) {
		token := sg.Token()
		words = append(words, fmt.Sprintf("%s/%s", token.Text(), token.Pos()))
	}
	fmt.Println(strings.Join(words, " "))
//      你/x 好/x 啊/x ,/x 张三/n

怎样可以不输出日志?

怎样可以不输出类似下面这样的日志?

2017/07/08 19:45:34 载入sego词典 dictionary.txt
2017/07/08 19:45:37 sego词典载入完毕

segmenter.go里,下面的代码是写死的,能否提供一个选项不输出日志?

log.Printf("载入sego词典 %s", file)
log.Fatalf("无法载入字典文件 \"%s\" \n", file)
log.Println("sego词典载入完毕")

『无法载入字典文件』,这个能让LoadDictionary()返回一个error吗?

是否支持拼音分词

你好,请问该包是否支持拼音分词,如 wo shi zhong guo ren 是否可分词为 wo/shi/zhongguo/ren,如何做才能支持该功能,谢谢!

中英文分词的一个问题

eg:
字典为

media视频

要分词的内容为 mediamedia视频

分词结果为

[mediamedia 视 频] // 理想结果应该为 [media media视频]

这个行为是预期的么?

分词文本文件

能否自定义分词, 还有分词文件, 那三个字段分别代表什么意思呢

请问如何自定义词典,有什么规律吗?

我想自定义词典,看词典的内容,第二列,三列,是什么意思呢?

AA制 3 n
AB型 3 n
AT&T 3 nz
...
二里頭 277 nrt
肾上腺 278 l
...
純天然 28 b
纯天然 28 b
挨个儿 28 d
...

我该如何自定义词典呢?
比如:多少本书,我想让 “多少本”,是其中的一个词。

关于并发

请问大神可否在一个服务中全局一个sego.Segmenter结构体来进行分词.

segmenter.go第185行

    // 当前字元没有对应分词时补加一个伪分词
    if numTokens == 0 || len(tokens[0].text) > 1 {
        updateJumper(&jumpers[current], baseDistance,
            &Token{text: []Text{text[current]}, frequency: 1, distance: 32, pos: "x"})
    }

感觉注释和代码不一致啊?

分词时应将原词加入到结果

比如demo中的词:**互联网历史上最大的一笔并购案
分词后的结果:**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n
应该把原词加入到结果:**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n **互联网历史上最大的一笔并购案

关于splitTextToWords的多语言支持

目前segmenter.go中的splitTextToWords函数,将会把所有non-english语言,分解为最小单位。

除了CJK中日韩等东亚语言,其它国家的语言都还是类似英语,属于字母型语言,利用unicode包中的IsLetter、IsNumber函数,可以很方便的处理。因此,建议将
_, size := utf8.DecodeRune(text[current:])
if size == 1 &&
(text[current] >= 'a' && text[current] <= 'z') ||
(text[current] >= 'A' && text[current] <= 'Z') ||
(text[current] >= '0' && text[current] <= '9') {

改为
r, size := utf8.DecodeRune(text[current:])
if unicode.IsLetter(r) || unicode.IsNumber(r) {

这样sego基本上可以用于所有的语言。

关于人名识别的问题

例如:
左/m 珊/n 凤/nr
王军/nr 虎/n
等都识别成单字了,是否可以酌情改进一下?

毕竟对于人名,不能是全部使用词库,要在算法中给予考虑。

内存开销比较大

内存开销比较大,有没有优化的计划,采用更紧凑一点的数据结构,现在通常启动后几百M就没有了。

testdata里有本小说

本来想好好学习下分词的,结果testdata有本小说,一上午就看书去鸟~
看书耽误学习呀。。。

search mode下的细粒度分词问题

例子描述:
在搜索模式下, ”13亿美元“ 被 分为了 “13/x 亿/m 亿美元/m”, 期望的是“13/x 美元/m 亿美元/m ”

看了一下代码, 可能是segmenter.go的120行
token.segments[iSegmentsToAdd] = &segments[iSegmentsToAdd]
应改为
token.segments[iSegmentsToAdd] = &segments[iToken] ?

词典不支持空格的分词

词典载入LoadDictionary感觉有点问题:
1 支持没有词性标注(为什么需要支持没有词性标注,强制要求每个词语按照"分词文本 频率 词性" 的格式会省掉很多问题)
2 对于分词文本包含空格的情况没有支持
即,如果词典中为 my darling 10 n 就会解析错误
3 log.fatal 可能会导致某个应用程序跑到这里的时候直接断开了,作为第三方包最好还是不要使用fatal这样的函数,把错误返回给外层来处理会不会比较好~

分词时应该把原词加入结果

比如demo中的词:**互联网历史上最大的一笔并购案
分词后的结果:**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n
应该把原词加入到结果:**/ns 互联/v 互联网/n 历史/n 上/f 最大/a 的/uj 一笔/m 并购/v 并购案/n **互联网历史上最大的一笔并购案

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.