GithubHelp home page GithubHelp logo

guowei-su / ner-english Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yanqiangmiffy/ner-english

0.0 0.0 0.0 638 KB

:computer: 英文命名实体识别(NER)的研究

License: MIT License

Python 86.21% HTML 13.79%

ner-english's Introduction

ner-english

🐆 英文命名实体识别(NER)的研究

准备

geo = Geographical Entity 地名
org = Organization 组织
per = Person 人物
gpe = Geopolitical Entity 地理政治
tim = Time indicator 时间
art = Artifact 艺术
eve = Event 时间
nat = Natural Phenomenon 自然现象

模型

  • 01_basline

    简单的标签统计特征

                 precision    recall  f1-score   support
    
        B-art       0.20      0.05      0.09       402
        B-eve       0.54      0.25      0.34       308
        B-geo       0.78      0.85      0.81     37644
        B-gpe       0.94      0.93      0.94     15870
        B-nat       0.42      0.28      0.33       201
        B-org       0.67      0.49      0.56     20143
        B-per       0.78      0.65      0.71     16990
        B-tim       0.87      0.77      0.82     20333
        I-art       0.04      0.01      0.01       297
        I-eve       0.39      0.12      0.18       253
        I-geo       0.73      0.58      0.65      7414
        I-gpe       0.62      0.45      0.52       198
        I-nat       0.00      0.00      0.00        51
        I-org       0.69      0.53      0.60     16784
        I-per       0.73      0.65      0.69     17251
        I-tim       0.58      0.13      0.21      6528
            O       0.97      0.99      0.98    887908
    
      avg / total       0.94      0.95      0.94   1048575
    
  • 02_random_forest_classifier:

    基本特征:首字母是否大写,是否小写,是否为大写,单词长度,是否为数字,是否全为字母

    上下文特征:上下文单词的标签以及词性特征

    方法:RandomForestClassifier

                 precision    recall  f1-score   support
    
        B-art       0.19      0.08      0.11       402
        B-eve       0.39      0.25      0.30       308
        B-geo       0.81      0.85      0.83     37644
        B-gpe       0.98      0.93      0.95     15870
        B-nat       0.28      0.28      0.28       201
        B-org       0.71      0.60      0.65     20143
        B-per       0.84      0.73      0.78     16990
        B-tim       0.90      0.79      0.84     20333
        I-art       0.05      0.02      0.02       297
        I-eve       0.21      0.10      0.13       253
        I-geo       0.74      0.64      0.69      7414
        I-gpe       0.80      0.45      0.58       198
        I-nat       0.40      0.20      0.26        51
        I-org       0.69      0.65      0.67     16784
        I-per       0.81      0.74      0.78     17251
        I-tim       0.76      0.47      0.58      6528
            O       0.98      0.99      0.99    887908
    
      avg / total       0.95      0.96      0.95   1048575
    
  • 03_CRF 条件随机场

    特征基本同上

      crf=CRF(algorithm='lbfgs',
              c1=0.1,
              c2=0.1,
              max_iterations=100,
              all_possible_transitions=False)    

    训练结果: python 03_conditional_random_fields.py --action train

               precision    recall  f1-score   support
    
        B-art       0.37      0.11      0.17       402
        B-eve       0.52      0.35      0.42       308
        B-geo       0.85      0.90      0.88     37644
        B-gpe       0.97      0.94      0.95     15870
        B-nat       0.66      0.37      0.47       201
        B-org       0.78      0.72      0.75     20143
        B-per       0.84      0.81      0.82     16990
        B-tim       0.93      0.88      0.90     20333
        I-art       0.11      0.03      0.04       297
        I-eve       0.34      0.21      0.26       253
        I-geo       0.82      0.79      0.80      7414
        I-gpe       0.92      0.55      0.69       198
        I-nat       0.61      0.27      0.38        51
        I-org       0.81      0.79      0.80     16784
        I-per       0.84      0.89      0.87     17251
        I-tim       0.83      0.76      0.80      6528
            O       0.99      0.99      0.99    887908
    
      avg / total       0.97      0.97      0.97   1048575
    

    测试结果 python 03_conditional_random_fields.py --action test

      Word           ||True ||Pred
      ==============================
      Helicopter     : O     O
      gunships       : O     O
      Saturday       : B-tim B-tim
      pounded        : O     O
      militant       : O     O
      hideouts       : O     O
      in             : O     O
      the            : O     O
      Orakzai        : B-geo B-geo
      tribal         : O     O
      region         : O     O
      ,              : O     O
      where          : O     O
      many           : O     O
      Taliban        : B-org B-org
      militants      : O     O
      are            : O     O
      believed       : O     O
      to             : O     O
      have           : O     O
      fled           : O     O
      to             : O     O
      avoid          : O     O
      an             : O     O
      earlier        : O     O
      military       : O     O
      offensive      : O     O
      in             : O     O
      nearby         : O     O
      South          : B-geo B-geo
      Waziristan     : I-geo I-geo
      .              : O     O
    
  • 04_Bi-LSTM

    句子长度统计:

    通过上图观察,句子最大长度max_len设置为50

    训练集和测试集:

      X_train:(43163, 50)
      X_test(4796,50)
      y_train(43163,50,17)
      y_test(4796,50,17)
    

    model:

      input=Input(shape=(max_len,))
      model=Embedding(input_dim=n_words,output_dim=50,input_length=max_len)(input)
      model=Dropout(0.1)(model)
      model=Bidirectional(LSTM(units=100,return_sequences=True,recurrent_dropout=0.1))(model)
      out=TimeDistributed(Dense(n_tags,activation='softmax'))(model) # softmax output layer
      model=Model(input,out)
      model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

    训练结果: python 04_bilstm.py --action train

      Epoch 1/5
      38846/38846 [==============================] - 90s 2ms/step - loss: 0.1410 - acc: 0.9643 - val_loss: 0.0622 - val_acc: 0.9818
      Epoch 2/5
      38846/38846 [==============================] - 88s 2ms/step - loss: 0.0550 - acc: 0.9838 - val_loss: 0.0517 - val_acc: 0.9849
      Epoch 3/5
      38846/38846 [==============================] - 88s 2ms/step - loss: 0.0459 - acc: 0.9865 - val_loss: 0.0477 - val_acc: 0.9860
      Epoch 4/5
      38846/38846 [==============================] - 89s 2ms/step - loss: 0.0413 - acc: 0.9878 - val_loss: 0.0459 - val_acc: 0.9865
      Epoch 5/5
      38846/38846 [==============================] - 89s 2ms/step - loss: 0.0385 - acc: 0.9885 - val_loss: 0.0444 - val_acc: 0.9868
    

    测试结果: python 04_bilstm.py --action test

      Word           ||True ||Pred
      ==============================
      The            : O     O
      French         : B-gpe B-gpe
      news           : O     O
      agency         : O     O
      ,              : O     O
      Agence         : B-org O
      France         : I-org B-geo
      Presse         : I-org I-geo
      ,              : O     O
      says           : O     O
      one            : O     O
      of             : O     O
      its            : O     O
      photographers  : O     O
      has            : O     O
      been           : O     O
      kidnapped      : O     O
      in             : O     O
      the            : O     O
      Gaza           : B-geo B-geo
      Strip          : I-geo I-geo
      .              : O     O
    
  • 05_Bi-LSTM+CRF

    model:

      input = Input(shape=(max_len,))
      model = Embedding(input_dim=n_words + 1, output_dim=20,
                        input_length=max_len, mask_zero=True)(input)  # 20-dim embedding
      model = Bidirectional(LSTM(units=50, return_sequences=True,
                                 recurrent_dropout=0.1))(model)  # variational biLSTM
      model = TimeDistributed(Dense(50, activation="relu"))(model)  # a dense layer as suggested by neuralNer
      crf = CRF(n_tags)  # CRF layer
      out = crf(model)  # output
      model = Model(input, out)       

    训练结果: python 05_bilstm_crf.py --action train

     Train on 38846 samples, validate on 4317 samples
     Epoch 1/5
     38846/38846 [==============================] - 137s 4ms/step - loss: 0.1651 - acc: 0.9546 - val_loss: 0.0691 - val_acc: 0.9766
     Epoch 2/5
     38846/38846 [==============================] - 136s 4ms/step - loss: 0.0513 - acc: 0.9815 - val_loss: 0.0429 - val_acc: 0.9834
     Epoch 3/5
     38846/38846 [==============================] - 131s 3ms/step - loss: 0.0365 - acc: 0.9855 - val_loss: 0.0376 - val_acc: 0.9849
     Epoch 4/5
     38846/38846 [==============================] - 132s 3ms/step - loss: 0.0315 - acc: 0.9871 - val_loss: 0.0344 - val_acc: 0.9859
     Epoch 5/5
     38846/38846 [==============================] - 131s 3ms/step - loss: 0.0287 - acc: 0.9879 - val_loss: 0.0339 - val_acc: 0.9857
    

    测试结果: python 05_bilstm_crf.py --action test

      Word           ||True ||Pred
      ==============================
      His            : O     O
      schedule       : O     O
      includes       : O     O
      talks          : O     O
      with           : O     O
      King           : B-per B-per
      Juan           : I-per I-per
      Carlos         : I-per I-per
      and            : O     O
      Spanish        : B-gpe B-gpe
      Prime          : B-per B-per
      Minister       : I-per I-per
      Jose           : I-per I-per
      Luis           : I-per I-per
      Rodriguez      : I-per I-per
      Zapatero       : I-per I-per
      .              : O     O
    

演示

The U.S. military in Iraq has sent a team of forensic experts to the northern city of Mosul to investigate the cause of Tuesday 's massive explosion at an American military base that killed 22 people and wounded 72 others .  

资料

https://www.one-tab.com/page/9-sFlWS0TTO_Kbcrnv4bqA

ner-english's People

Contributors

yanqiangmiffy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.