GithubHelp home page GithubHelp logo

Comments (9)

xuwenbin avatar xuwenbin commented on July 16, 2024

刚试了一下cppjieba自己的在线演示页面,输入“美国纽约”,出来的分词结果只有一个“美国纽约”;输入“美国芝加哥”,则出来的分词结果是两个“美国”,“芝加哥”。看来cppjieba本身有问题呢。。。

from nodejieba.

yanyiwu avatar yanyiwu commented on July 16, 2024

是这样的,可能你也有一些误会。
因为词典里面有【美国纽约】这个词,所以美国纽约这个词会被当成一个词分出来。
而没有【美国芝加哥】,但是有【美国】和【芝加哥】

from nodejieba.

xuwenbin avatar xuwenbin commented on July 16, 2024

是什么原因,“美国纽约”作为单独词条,而“美国芝加哥”却不是?

顺带问一个问题,nodejieba自带的词典的内容来源是哪里?如何更新?

from nodejieba.

yanyiwu avatar yanyiwu commented on July 16, 2024

默认的词典文件在 https://github.com/yanyiwu/nodejieba/blob/master/index.js#L2

from nodejieba.

xuwenbin avatar xuwenbin commented on July 16, 2024

谢谢!

我查看了一下字典内的内容,发现“美国纽约”和“纽约”同时存在。而且“纽约”的权重比“美国纽约”大;在这个情况下,jieba的分词算法最好能够比对两种可能性后,挑选权重的路径。

我没有通读jieba的代码,上面的观点有可能是错误的或者不全面的。请jieba的作者见谅!

from nodejieba.

yanyiwu avatar yanyiwu commented on July 16, 2024

有两种分词选择,一种是【美国纽约】,一种是【美国】【纽约】,
因为【美国】【纽约】 的权重比【美国纽约】更小,所以才选择了【美国纽约】。
具体分词算法可以参考一下相关文档,比如: http://www.thinkface.cn/thread-1303-1-1.html

from nodejieba.

xuwenbin avatar xuwenbin commented on July 16, 2024

权重是如何定的?

日常生活里,【纽约】比【美国纽约】更常用/常见。

from nodejieba.

yanyiwu avatar yanyiwu commented on July 16, 2024

词组合越多,权重是会衰减越多的,这个原因导致长词会优先短词被切分出来。你如果要按常见来算的话,那【打】和【的】也比【打的】更常见。

from nodejieba.

xuwenbin avatar xuwenbin commented on July 16, 2024

单个字的出现频率和常用词组的出现频率还是有应该有所区别才好。【打】和【的】出现频率是高,但是它们作为单独的字是没有意义的。而【纽约】和【美国纽约】本身都是有意义的词组,这个时候,光看词频来定貌似有点机械化处理了。当然,我知道这个是个难点。所以想了解一下,jieba是否有计划(或者已经有计划)根据网上常用词/词条的出现频率定期调整字典的内容呢?

PS,我现在的解决方案,只能是把【美国纽约】从字典里面删除掉。

from nodejieba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.