GithubHelp home page GithubHelp logo

xiaoqingwang / zh-ner-tf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from determined22/zh-ner-tf

0.0 0.0 0.0 109.75 MB

A very simple BiLSTM-CRF model for Chinese Named Entity Recognition 中文命名实体识别 (TensorFlow)

Perl 33.22% Python 66.78%

zh-ner-tf's Introduction

A simple BiLSTM-CRF model for Chinese Named Entity Recognition

This repository includes the code for buliding a very simple character-based BiLSTM-CRF sequence labeling model for Chinese Named Entity Recognition task. Its goal is to recognize three types of Named Entity: PERSON, LOCATION and ORGANIZATION.

This code works on Python 3 & TensorFlow 1.2 and the following repository https://github.com/guillaumegenthial/sequence_tagging gives me much help.

Model

This model is similar to the models provided by paper [1] and [2]. Its structure looks just like the following illustration:

Network

For one Chinese sentence, each character in this sentence has / will have a tag which belongs to the set {O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG}.

The first layer, look-up layer, aims at transforming each character representation from one-hot vector into character embedding. In this code I initialize the embedding matrix randomly and I know it looks too simple. We could add some language knowledge later. For example, do tokenization and use pre-trained word-level embedding, then every character in one token could be initialized with this token's word embedding. In addition, we can get the character embedding by combining low-level features (please see paper[2]'s section 4.1 and paper[3]'s section 3.3 for more details).

The second layer, BiLSTM layer, can efficiently use both past and future input information and extract features automatically.

The third layer, CRF layer, labels the tag for each character in one sentence. If we use a Softmax layer for labeling, we might get ungrammatic tag sequences beacuse the Softmax layer labels each position independently. We know that 'I-LOC' cannot follow 'B-PER' but Softmax doesn't know. Compared to Softmax, a CRF layer can use sentence-level tag information and model the transition behavior of each two different tags.

Dataset

#sentence #PER #LOC #ORG
train 46364 17615 36517 20571
test 4365 1973 2877 1331

It looks like a portion of MSRA corpus. I downloaded the dataset from the link in ./data_path/original/link.txt

data files

The directory ./data_path contains:

  • the preprocessed data files, train_data and test_data
  • a vocabulary file word2id.pkl that maps each character to a unique id

For generating vocabulary file, please refer to the code in data.py.

data format

Each data file should be in the following format:

中	B-LOC
国	I-LOC
很	O
大	O

句	O
子	O
结	O
束	O
是	O
空	O
行	O

If you want to use your own dataset, please:

  • transform your corpus to the above format
  • generate a new vocabulary file

How to Run

train

python main.py --mode=train

test

python main.py --mode=test --demo_model=1521112368

Please set the parameter --demo_model to the model that you want to test. 1521112368 is the model trained by me.

An official evaluation tool for computing metrics: here (click 'Instructions')

My test performance:

P R F F (PER) F (LOC) F (ORG)
0.8945 0.8752 0.8847 0.8688 0.9118 0.8515

demo

python main.py --mode=demo --demo_model=1521112368

You can input one Chinese sentence and the model will return the recognition result:

demo_pic

References

[1] Bidirectional LSTM-CRF Models for Sequence Tagging

[2] Neural Architectures for Named Entity Recognition

[3] Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition

[4] https://github.com/guillaumegenthial/sequence_tagging

zh-ner-tf's People

Contributors

determined22 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.