GithubHelp home page GithubHelp logo

BUG: 被忽略的空格 about node-segment HOT 10 OPEN

leizongmin avatar leizongmin commented on September 4, 2024
BUG: 被忽略的空格

from node-segment.

Comments (10)

leizongmin avatar leizongmin commented on September 4, 2024

不好意思,过了这么久才看到

from node-segment.

leizongmin avatar leizongmin commented on September 4, 2024

建议在分词前使用 split(/\s+/) 来分割

from node-segment.

hotoo avatar hotoo commented on September 4, 2024

这个不准备修复了么?

from node-segment.

leizongmin avatar leizongmin commented on September 4, 2024

1、导致自动把一 一合并的原因是:分词过程中是没有记录空格的,因此一 一的分词结果是['一', '一'],由于默认启用了一个优化功能,把相邻两个数字合并了;
2、如果要修复的话,目前最简单的方法就是分词前使用 split(/\s+/) 来分割一次;
3、如果直接在 segment 模块中修复的话,我需要再考虑一下这个改动是否会对一些现有的程序产生影响。

from node-segment.

leizongmin avatar leizongmin commented on September 4, 2024

已修复。请使用 v0.0.5 版本

from node-segment.

hotoo avatar hotoo commented on September 4, 2024

👍 建议仓库打上 tag,用 milestones, release 这些管理起来。

from node-segment.

leizongmin avatar leizongmin commented on September 4, 2024

OK,已打上标签“v0.0.5”
在 Node.js 上使用,直接用 npm install [email protected] 即可安装指定版本

from node-segment.

hotoo avatar hotoo commented on September 4, 2024

还是有问题,建议 Reopen。

0.0.5 版把空白字符当前分词要素,但是最终结果中还是忽略了空白字符本身:

segment.doSegment("a a")
// 输出结果:
[ { w: 'a', p: 16 }, { w: 'a', p: 16 } ]
// 正确结果应该是:
[ { w: 'a', p: 16 }, { w: ' ', p: 16 }, { w: 'a', p: 16 } ]

from node-segment.

leizongmin avatar leizongmin commented on September 4, 2024

这不是Bug,而是设计的时候分词结果自动去掉了“无用”的空白字符。

不知道是否有必要保留这些空格

from node-segment.

hotoo avatar hotoo commented on September 4, 2024

程序处理的时候,空白文本是内容的一部分,不应该被忽略掉。
pinyin 处理时,分词模块忽略空白字符,会导致输出不一致:

han = "a a";
py = pinyin(han);
// 如果分词模块忽略掉空白字符:
py === "aa";
// 正确的应该是。
py === "a a";

from node-segment.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.