GithubHelp home page GithubHelp logo

nacyzhaomin / mitie_chinese_wikipedia_corpus Goto Github PK

View Code? Open in Web Editor NEW

This project forked from howl-anderson/mitie_chinese_wikipedia_corpus

0.0 1.0 0.0 6 KB

Pre-trained Wikipedia corpus by MITIE

License: MIT License

mitie_chinese_wikipedia_corpus's Introduction

中文维基百科 MITIE 语料库

这个项目旨在为训练 MITIE 中文语料库提供工具和指南. 通常情况下,训练这个模型,需要一台高配置、高网速的服务器大约运行三天,才能训练完毕,为了节约时间,本项目也将提供预训练好的模型。

从零开始训练

构建维基百科语料库

见项目 chinese-wikipedia-corpus-creator,维基百科的语料库的最终数据目录为 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files。可以使用两种方式获得数据:直接下载已经预处理好的语料库 或者 从零开始处理语料库

直接下载已经预处理好的语料库

直接下载 chinese-wikipedia-corpus-creator 已经处理好的文件,下载地址在 Release of chinese-wikipedia-corpus-creator,下载后放置到 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

从零开始处理语料库

chinese-wikipedia-corpus-creator 源代码下载或者克隆至 third-party/chinese-wikipedia-corpus-creator,按照该项目文档的说明,运行相关代码,产生中文维基百科语料库。确保最后的输出文件位于 third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

构建 MITIE 工具

获取 MITIE 源代码

这里选择将 MITIE clone 至本项目的 third-party 目录:

$ git clone https://github.com/mit-nlp/MITIE.git

编译 MITEIE

MITIE 是一个工具的集合包,本项目所需的只是其中的 wordrep 工具

$ cd third-party/MITIE/tools/wordrep
$ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --config Release

训练模型

$ ./third-party/MITIE/tools/wordrep/build/wordrep --count-words 800000 --word-vects --basic-morph --cca-morph ./third-party/chinese-wikipedia-corpus-creator/token_cleaned_plain_files

下载预训练好的模型

可下载的模型列表见 releases (已提供针对**用户的快速下载链接)

如何贡献代码

请阅读 CONTRIBUTING.md 并向我们发送 pull requests.

版本控制方案

使用 SemVer 的标准方案. 访问 tags on this repository 可了解所有版本信息.

作者

全体贡献者信息在 contributors 处可见。

授权协议

本项目采用 MIT License - 详情请见 LICENSE.md

致谢

mitie_chinese_wikipedia_corpus's People

Contributors

howl-anderson avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.