GithubHelp home page GithubHelp logo

111aaabbb / unsupervised_ner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ajitrajasekharan/unsupervised_ner

0.0 0.0 0.0 4.46 MB

Self-supervised NER prototype - updated version (69 entity types - 17 broad entity groups). Uses pretrained BERT models with no fine tuning. State-of-art performance on 3 biomedical datasets

License: MIT License

Shell 2.82% Python 97.18%

unsupervised_ner's Introduction

Self-supervised NER (prototype)

This repository containes code for solving NER with self-supervised learning (SSL) alone avoiding supervised learning.


Post describing the second iteration of this method

Model performance on 11 datasets

Additional links

Installation

If the use case is to automatically detect all noun phrase spans in a sentence, then POS tagger needs to be installed. If we only require specific phrases of interest to us in a sentence to be tagged (e.g. colorectal cancer above), then POS tagger install is not required. In the first use case, 7 microservices (POS tagger is made up of two microservices) are started. In the second use, case 5 microservices are started.

Step 1. Installing and starting microservices common to both use cases

Run ./setup.sh

this will install and load all 5 microservices. When done (assuming all goes well) it should display the output of a test query

Step 2. Install POS service

(this can be skipped if we only require specific phrases to be tagged)

Install POS service using this link

Make sure to run both services in the install instructions

Note POS service requires python 2.7 environment

Revision notes for major updates

July 2022

  • Added the generation of bootstrap file. These component files can be edited to improve the bootstrap list. Every time the bootstrap list is updated, we need to run the clustering run.sh (and choose option 6) in bert_vector_clustering to both magnify this list as well as generate entity signatures for each vocabulary term for use in NER. A labeled set of entity files with instructions is present here

17 Jan 2022

  • Ensemble service of NER with two models tested on 11 NER benchmarks as described in this post.

17 Sept 2021

  • This can now be run as a service. run_servers.sh
  • Simple Ensembling service added for combining results of multiple NER servers

Second version usage notes

  • If the install runs into issess, we could start the services independantly to isolate problem.
  • First install descriptors service. Confirm it works. Then install NER service. Do this for both models (bio and phi). Then test ensemble service. Ensemble is in the subdirectory ensemble in the NER service.
  • Test sets to test the output of NER against 11 benchmarks are in this repository.
  • This repository can be used as a metric to test a pretrained model trained from scratch. We can give the model an F1-score just like we do fine tuned model. To do this, we need to convert human labels file (e.g. bootstrap_entities.txt) into magnified entity vectors using this repository. Just invoke run.sh and use the subword neighbor clustering option . If we want to pick the initial terms to label - the creation of bootstrap_entities.txt itself, run the same tool, but just choose the generate cluster option and adaptive clustering. This will yield about 4k cluster pivots. We can start labeling them and then create entity vectors. The entity vectors (e.g. labels.txt) can then be used with descriptor service to test model. If we are creating new entity types, then the entity map file needs to be updated accordingly to map subtypes to types, or just add new types.

First Version Usage notes

The unsupervised NER tool can be used in three ways.

  1. to tag canned sentences (option 1)
    • $ python3 main_ner.py 1
  2. To tag custom sentences present in a file (option 2)
    • $ python3 main_ner.py 2 sample_test.txt
  3. To tag single entities in custom sentences present in a file (option 3) where the single entity is specified in a sentence in the format name:__ entity __ . Concrete example: Cats and Dogs:__ entity __ are pets where Dogs is the term to be tagged. Single or multiple words/phrases within a sentence can also be tagged. Example: Her hypophysitis:__ entity __ secondary to ipilimumab:__ entity __ was well managed with supplemental:__ entity__ hormones:__ entity __
    • $ python main_NER.py 3 single_entity_test.txt

License

This repository is covered by MIT license.

The POS tagger/Dep parser that this service depends on is covered by a GPL license.

unsupervised_ner's People

Contributors

ajitrajasekharan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.