GithubHelp home page GithubHelp logo

zheng5yu9 / unsupervised_extract_detect_words Goto Github PK

View Code? Open in Web Editor NEW
25.0 2.0 6.0 7.36 MB

multiprocess unsupervised chinese_detect_words ngram_combination

Python 100.00%
pmi mutual-information entropy multiprocessing ngram detect unsupervised-learning segment recursive hotword-detection

unsupervised_extract_detect_words's Introduction

1.思路:借鉴之前有一篇blog,利用人人网数据进行新词挖掘的**,做了改进优化;

2.原始思路: 利用jieba对文档分词,3个相邻词为一组,计算两个词的左信息熵,右信息熵,内部的凝聚度,并据此进行计算分数,根据分数大小获取新词;

3.优化点:

      1.针对只能结合两个词,泛化到结合计算相邻N个词;

      2.内部互信息【凝聚度计算】,归一化到长度=1个词的情况下的值,可以实现不同长度词在同一纬度下进行比较;
      
      3.多进程处理,提高运行速度;
      
      4.添加过滤机制,根据停用词,高频常用词等进行过滤

4.入口文件: segment_multi.py

执行方式: python segment_multi.py

参数修改文件:configs.py

5.效果展示

('_重大_疾病', 0.017789747314352424)

('_保障_范围', 0.015639743403053734)

('_本_公司', 0.014212133249451173)

('_完全_丧失', 0.013672071599779227)

('_意外_伤害', 0.010722245979224557)

('_明确_诊断', 0.009062853195861094)

('_日常生活_活动', 0.008990786509666062)

('_六项_基本_日常生活', 0.008813957372202039)

('_基本_日常生活', 0.008694797110512052)

('_基本_日常生活_活动', 0.008671016020472998)

('_保险_事故', 0.008504469334120192)

('_六项_基本_日常生活_活动', 0.008471400808888209)

('_能力_完全_丧失', 0.008404916576493579)

('_全部_条件', 0.008136980840438046)

('_无法_独立', 0.008091270307811042)

('_满足_下列_全部_条件', 0.008055553080109046)    
    
('_现金_价值', 0.007895715475057304)

unsupervised_extract_detect_words's People

Contributors

zheng5yu9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

unsupervised_extract_detect_words's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.