GithubHelp home page GithubHelp logo

gemire / autoner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shangjingbo1226/autoner

0.0 1.0 1.0 3.63 MB

Learning Named Entity Tagger from Domain-Specific Dictionary

Home Page: https://shangjingbo1226.github.io/AutoNER/

License: Apache License 2.0

Makefile 0.01% Shell 0.07% C++ 0.76% Python 2.27% ChucK 96.88%

autoner's Introduction

AutoNER

Check Our New NER ToolkitπŸš€πŸš€πŸš€

  • Inference:
    • LightNER: inference w. models pre-trained / trained w. any following tools, efficiently.
  • Training:
    • LD-Net: train NER models w. efficient contextualized representations.
    • VanillaNER: train vanilla NER models w. pre-trained embedding.
  • Distant Training:
    • AutoNER: train NER models w.o. line-by-line annotations and get competitive performance.

License Documentation Status

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Details about AutoNER can be accessed at: https://arxiv.org/abs/1809.03599

Model Notes

AutoNER-Framework

Benchmarks

Method Precision Recall F1
Supervised Benchmark 88.84 85.16 86.96
Dictionary Match 93.93 58.35 71.98
Fuzzy-LSTM-CRF 88.27 76.75 82.11
AutoNER 88.96 81.00 84.80

Training

Required Inputs

  • Tokenized Raw Texts
    • Example: data/BC5CDR/raw_text.txt
      • One token per line.
      • An empty line means the end of a sentence.
  • Two Dictionaries
    • Core Dictionary w/ Type Info
      • Example: data/BC5CDR/dict_core.txt
        • Two columns (i.e., Type, Tokenized Surface) per line.
        • Tab separated.
      • How to obtain?
        • From domain-specific dictionaries.
    • Full Dictionary w/o Type Info
      • Example: data/BC5CDR/dict_full.txt
        • One tokenized high-quality phrases per line.
      • How to obtain?
        • From domain-specific dictionaries.
        • Applying the high-quality phrase mining tool on domain-specific corpus.
  • Pre-trained word embeddings
    • Train your own or download from the web.
    • The example run uses embedding/bio_embedding.txt, which can be downloaded from our group's server. For example, curl http://dmserv4.cs.illinois.edu/bio_embedding.txt -o embedding/bio_embedding.txt. Since the embedding encoding step consumes quite a lot of memory, we also provide the encoded file in the autoner_train.sh.
  • [Optional] Development & Test Sets.
    • Example: data/BC5CDR/truth_dev.ck and data/BC5CDR/truth_test.ck
      • Three columns (i.e., token, Tie or Break label, entity type).
      • I is Break.
      • O is Tie.
      • Two special tokens <s> and <eof> mean the start and end of the sentence.

Dependencies

This project is based on python>=3.6. The dependent package for this project is listed as below:

numpy==1.13.1
tqdm
torch-scope>=0.5.0
pytorch==0.4.1

Command

To train an AutoNER model, please run

./autoner_train.sh

To apply the trained AutoNER model, please run

./autoner_test.sh

You can specify the parameters in the bash files. The variables names are self-explained.

Citation

Please cite the following two papers if you are using our tool. Thanks!

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary}, 
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei}, 
  booktitle = {EMNLP}, 
  year = 2018, 
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}

autoner's People

Contributors

liyuanlucasliu avatar shangjingbo1226 avatar

Watchers

 avatar

Forkers

jake2050

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.