GithubHelp home page GithubHelp logo

awesome-archive / bert_document_classification Goto Github PK

View Code? Open in Web Editor NEW

This project forked from andriymulyar/bert_document_classification

0.0 0.0 0.0 51 KB

architectures and pre-trained models for long document classification.

Python 100.00%

bert_document_classification's Introduction

๐Ÿ“– BERT Long Document Classification ๐Ÿ“–

an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification.

pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection.

To sustain future development and improvements, we interface pytorch-transformers for all language model components of our architectures. Additionally, their is a blog post describing the architecture.

Model Dataset # Labels Evaluation F1
n2c2_2006_smoker_lstm I2B2 2006: Smoker Identification 4 0.981
n2c2_2008_obesity_lstm I2B2 2008: Obesity and Co-morbidities Identification 15 0.997

Installation

Install with pip:

pip install bert_document_classification

or directly:

pip install git+https://github.com/AndriyMulyar/bert_document_classification

Use

Maps text documents of arbitrary length to binary vectors indicating labels.

from bert_document_classification.models import SmokerPhenotypingBert
from bert_document_classification.models import ObesityPhenotypingBert

smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction

obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like.

smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"])

More examples.

Replication

Go to the directory /examples/ml4health_2019_replication. This README will give instructions on how to appropriately insert data from DBMI to replicate the results in the paper.

Notes

  • For training you will need a GPU.
  • For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work.
  • Model downloads are cached in ~/.cache/torch/bert_document_classification/. Try clearing this folder if you have issues.

Acknowledgement

If you found this project useful, consider citing our extended abstract.

@misc{mulyar2019phenotyping,
    title={Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models},
    author={Andriy Mulyar and Elliot Schumacher and Masoud Rouhizadeh and Mark Dredze},
    year={2019},
    eprint={1910.13664},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.

bert_document_classification's People

Contributors

andriymulyar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.