GithubHelp home page GithubHelp logo

youmu257 / nlp-python Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 2.92 MB

Some common used python tool for text pre-processing and NLP tools, including chinese and english.

License: MIT License

Python 100.00%

nlp-python's Introduction

Common NLP tools

Some common used python tool for text pre-processing and NLP tools, including chinese and english.

Requirement

  1. jibea : pip install jieba
  2. opencc for python, download from link
  3. NLTK : pip install nltk
  4. gensim : pip install gensim

Usage

Prerocessing

  1. Chinese segment
    We use jieba to segment chinese corpus.
    Command python ChineseSegment.py input.txt output.txt user_dict(default is None)
    to segment a corpus file or reference sample code in your code.

     # sample code
     # init jieba
     segment = ChineseSegment()
     # segment = ChineseSegment('data//dictionary//user_dict.dict')
     arr = ['測試','正在測試','今天晚餐吃啥','小明碩士畢業於中國科學院計算所']
     # segment chinese array or string
     print(segment.cut(arr[3]))
     print(segment.cut(arr))
     # segment chinese array or string for search mode
     print(segment.cutForSearch(arr[3]))
     # segment and POStagging input sentence
     print(segment.tokenizer(arr[3]))
    
  2. Conversion in traditional chinese and simplified chinese We use opencc in python to convert.
    Command python opencc.py input.txt output.txt conversion_model(default=s2t) to convert a corpus file

  3. Common corpus pre-processing This tool contain stemming, stopwords, removing punctuation, removing url and convert word to lower case.
    We adopt argparse to execute command. You need input -i and -o to assign the path of input file and output file. Then, you can choose -s, -p, -u and -sw to stemming, filter punctuation, filter url and filter stopwords. If you want to turn on all settings, you can use -a parameter.
    For example, python Preprocessing.py -i input.txt -o output.txt -a.
    We also provide sample code(demo function) in Preprocessing.py.

NLP tool

  1. Word2Vec
    Just a example of how to use wWrd2Vec in gensim. Word2Vec used to find similar opinion word and word clustering. You can reference the code of word2vec.py to train a new model and test it. I reference this website. If you need a pre-trained model for english, you can use the model trained by google.
  2. Doc2Vec
    Just a example of how to use Dord2Vec in gensim.
    Doc2Vec used to implement document clustering, support QA sustem, or find similar opinion sentence. You can reference the code of doc2vec.py to train a new model and test it. I reference this website.

To Do

  1. More preprocessing tool, like parse tree, basic statistics, etc.
  2. Some example of NLP tools, like topic model, named entity recognition, etc.

nlp-python's People

Contributors

youmu257 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

aiedward

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.