GithubHelp home page GithubHelp logo

期望更细致的切分结果 about nodejieba HOT 19 OPEN

hotoo avatar hotoo commented on August 15, 2024
期望更细致的切分结果

from nodejieba.

Comments (19)

yanyiwu avatar yanyiwu commented on August 15, 2024

我能理解你的需求。
不过有一个切词模式(搜索引擎模式:QUERY)和你的需求相近,
nodejieba.cut("南京市长江大桥", "QUERY");

原理是当词的长度大于一个阈值的时候,会对它进行细的切分。
比如当阈值设置为3(现在本项目的默认阈值是4)的时候,
"南京市长江大桥" 会被切分成 ["南京市","长江","长江大桥","大桥"]

不知道这种分词模式是否符合你的需求。

PS: 之后关于各种切词模式的文档会跟上。

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024
  1. 结果中有重复出现词语,这个不是期望的,也不好处理。
  2. 期望的结果是以成语、词语、短语这个顺序为优先级进行切分。
    1. 是成语的以整个成语为一个词(成语一般 4 个字);
    2. 是词语的以整个词语为一个词(词语一般 2 个字);
    3. 是短语的以整个短语为一个词(短语一般是地名、人名、物名等,但是最好是可以选择长模式还是短模式,对于拼音来说,应该期望是短模式)。

from nodejieba.

Honghe avatar Honghe commented on August 15, 2024

就类似下边的demo,使用搜索分词,NLPIR能够把香港特别行政区精确切分为3个词
如何能达成这个效果?

nlpir

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024

@Honghe 这个小粒度分词功能正在开发中。其实从这个issue被贴上 enhancement 标签的时候就打算加上这个功能了。只是最近工作比较忙还没弄完。

from nodejieba.

Honghe avatar Honghe commented on August 15, 2024

Nice.
是要用什么算法实现?

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024

@Honghe 没准备用什么高深的算法,主要先在工程上做一些词长限制。

from nodejieba.

mthli avatar mthli commented on August 15, 2024

一个经典的测试语句...

default

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024

@yanyiwu 这个功能现在怎么样了?

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024

@hotoo 基本完成这个功能的,我找个时间完善一下然后在README里面写明一下。

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024

GOOD,希望 pinyin 3.0 可以用上这个特性 :)

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024

最新版本 npm install [email protected]
使用:

var nodejieba = require("nodejieba");
console.log(nodejieba.cut("南京市长江大桥", "MP", 3));

输出:

[ '南京市', '长江', '大桥' ]

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024

尝试了下。应该说,还是不太符合期望。古代成语也会被强拆:

[ '破', '釜', '沉舟' ]
[ '叶公', '好', '龙' ]

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024

粒度换成 4 基本满足需求了,不错。

[ '南京市', '长江大桥' ]
[ '南京', '长江大桥' ]
[ '九江', '长江大桥' ]
[ '武汉', '长江大桥' ]
[ '破釜沉舟' ]
[ '叶公好龙' ]
[ '香港', '特别', '行政区' ]

from nodejieba.

Honghe avatar Honghe commented on August 15, 2024

跟NLPIR还是有不一样,不知其怎么实现的

from nodejieba.

willin avatar willin commented on August 15, 2024

mark

from nodejieba.

willin avatar willin commented on August 15, 2024
console.log({
    cut: nodejieba.cut(str, 'MP', 3),
    tag: nodejieba.tag(str),
    extract: nodejieba.extract(str, 5)
  });

参数数字从1-5都试过,但str输入为开灯的时候,结果都是一样的:

{ cut: [ '开灯' ],
  tag: [ { word: '开灯', tag: 'v' } ],
  extract: [ { word: '开灯', weight: 10.0294766411 } ] }

开卧室灯,分词的结果是:

{ cut: [ '开', '卧室', '灯' ],
  tag:
   [ { word: '开', tag: 'v' },
     { word: '卧室', tag: 'n' },
     { word: '灯', tag: 'n' } ],
  extract: [ { word: '卧室', weight: 8.20023407859 } ] }

如何能返回 两个结果? 不仅是cut,tag和extract也需要同样的结果。

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024
 nodejieba.cut(str, 'MP', 3),

这个用法在 version 2.0 以后就被废弃了。

from nodejieba.

hotoo avatar hotoo commented on August 15, 2024

推荐用什么?

from nodejieba.

yanyiwu avatar yanyiwu commented on August 15, 2024

@hotoo 推荐这么写: hotoo/pinyin#101

from nodejieba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.