GithubHelp home page GithubHelp logo

open_source_tools_for_entity_recognization's Introduction

源数据处理

这里所有的工具所需要的数据均使用相同的文本处理: problemtext_processor.py.

ltp

github仓库是由哈工大提供的开源工具.

配置

参考 pip 安装方式报错,通过源码方式安装。 报错:

patch/libs/python/src/converter/builtin_converters.cpp:51:35: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
       return PyUnicode_Check(obj) ? _PyUnicode_AsString(obj) : 0;

error: command 'gcc' failed with exit status 1

解决方法参考

结构

  • test_ltp.py用于测试
  • test_ltp/result.txt存放了部分测试结果

技术原理

论文: Wanxiang Che, Zhenghua Li, Ting Liu. LTP: A Chinese Language Technology Platform. In Proceedings of the Coling 2010:Demonstrations. 2010.08, pp13-16, Beijing, China.

  • 模型: maximum entropy model (A Maximum Entropy Approach to Natural Language Processing) Berger等人在1996提出的模型.
  • 语料: 主要使用的人们日常的语料

测试结果

ltp提供了对人名,地名,机构名这些实体类型识别的支持.API
从结果来看,识别实体为人名,地名,国家名,无法正确识别所需实体.

Jiagu

github仓库.这一个名叫 思知(ownthink) 的开源项目。

配置

pip install jiagu

结构

  • test_jiagu.py 用于测试
  • test_jiagu/result.txt 存放了对于OJ文本处理的部分结果
  • test_jiagu/result_sample.txt存放了对于纯文本的政治新闻数据的处理结果

技术原理

没有找到具体的论文,通过查看源码,使用的是BILSTM+CRF的方式

测试结果

ltp相同,Jiagu仅仅识别人名,地名,机构名,并且Jiagu对于中英混杂的数据处理不好(通过对比result_sample.txt,result.txt) Jiagu所使用的标注方法:

B-PER,I-PER 人名
B-LOC,I-LOC 地名
B-ORG,I-ORG 机构名
B-XX代表实体名开始
I-XX表示实体名中间的,末尾的词语

coreNLP

standfordcorenlp使用斯坦福大学开发的nlp工具,这里只是用了命名实体识别工具,相比较与之前的工具,这个工具可以是识别日期,时间等.

配置

  • 下载CorNLP链接
  • 下载中文包链接
  • 解压CoreNLP到path/to/corenlp,将中文包放置到该文件夹下。
  • pip install stanfordcorenlp
  • 调用方式nlp = StanfordCoreNLP("path/to/corenlp", lang="zh")

结构

  • test_stanfordcorenlp.py用于测试
  • test_stanford_corenlp/result.txt保存了部分识别结果

技术原理

论文:Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

  • 模型: 主要使用的是CRF.
  • 语料: CoNLL, ACE, MUC, ERE

测试结果

corenlp相比于前两个工具,效果最好,识别更大范围的实体,但是也不能识别目标实体.

open_source_tools_for_entity_recognization's People

Contributors

lif323 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.