GithubHelp home page GithubHelp logo

Comments (5)

deepcs233 avatar deepcs233 commented on August 15, 2024 2

sorry,不好意思,这次fix完忘了测试一致性了,我会尽快fix

from jieba_fast.

savourylie avatar savourylie commented on August 15, 2024 1

原來如此!這個差異不大,我想應該不會造成太大的問題,太感謝了!

from jieba_fast.

deepcs233 avatar deepcs233 commented on August 15, 2024

@savourylie bug已fix,原因是py3与py2的字符串系统不一致所导致
fix后的测试如下:
Test set: 西遊記
Number of unique jieba words: 31095
Number of unique jieba_fast words: 31136
Number of unique words in intersection: 31080
Number of unique words in union: 31151
IOU (intersection over union): 0.9977207794292318
jieba: 诗|曰|:| | |混沌|未|分|天地|乱|,|茫茫|渺渺|无人|见|。| |自从|盘古|破鸿|濛|,|开辟|从兹|清浊|辨|。| |覆载|群生|仰至仁|,|发明|万物|皆|成善|。| |欲知|造化|会元功|,|须|看|西游|释厄传|。| |盖闻|天地|之数|,|有|十二万|九千|六|百岁|为|一元|。|将|一元|分为|十二|会|,|乃子|、|丑|、|寅|、|卯|、|辰|、|巳|、|午|、|未|、|申|、|酉|、|戌|、|亥|之|十二支|也|。|每会|该|一万八|百岁|。|且|就
jieba_ft: 诗|曰|:| | |混沌|未|分|天地|乱|,|茫茫|渺渺|无人|见|。| |自从|盘古|破鸿|濛|,|开辟|从兹|清浊|辨|。| |覆载|群生|仰至仁|,|发明|万物|皆|成善|。| |欲知|造化|会元功|,|须|看|西游|释厄传|。| |盖闻|天地|之数|,|有|十二万|九千|六|百岁|为|一元|。|将|一元|分为|十二|会|,|乃子|、|丑|、|寅|、|卯|、|辰|、|巳|、|午|、|未|、|申|、|酉|、|戌|、|亥|之|十二支|也|。|每会|该|一万八|百岁|。|且|就
jieba_time: 3.228s
jieba_fast_time: 1.857s

from jieba_fast.

deepcs233 avatar deepcs233 commented on August 15, 2024

['鲯把总', '狼虫諕倒的', '不諕倒那', '諕得', '諕得那多官尽皆', '是虎諕我也', '諕得老', '諕了人', '諕得他', '諕人', '諕倒了老', '穵蛤', '掬哷哷行', '諕得众猴', '已此諕', '諕怕了', '諕得那', '諕得我', '諕得打了', '諕得个庞', '草纥繨', '蒿苦藚', '紾掠了', '就諕得', '吃了諕', '諕得那洞里群魔都', '諕得众僧跑了', '諕得个伯钦', '諕得个老', '鲟鲯追白蟮', '諕得斤力', '諕得他手', '都諕得', '諕得那狼虫颠窜', '諕得各洞妖王都闭户', '尾子趬了一趬', '諕得多官', '又諕得', '諕了', '諕死', '諕了一跌', '被他諕怕了', '諕了魂', '諕得都', '諕杀了也', '諕得个', '諕走龟鼍', '諕倒了', '諕得这', '你若諕了我的', '弹打鋋罗双', '諕走了三魂', '諕杀我也', '就諕杀我也', '虎諕我', '諕得脚软身麻']
这些是分词不同的地方,都是生僻字,猜测与unicode等底层实现相关,不好debug。

from jieba_fast.

deepcs233 avatar deepcs233 commented on August 15, 2024

Number of unique jieba words: 31115
Number of unique jieba_fast words: 31160
Number of unique words in intersection: 31103
Number of unique words in union: 31172
IOU (intersection over union): 0
这是我用py2跑的结果

from jieba_fast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.