NLP model to classify abstract sentences into the role they play (e.g. objective, methods, results, etc..) to enable researchers to skim through the literature and dive deeper when necessary.
- More specificially, Replicating the deep learning model behind the 2017 paper PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts.
- The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017.
Some miscellaneous information:
- PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
PubMed_200k_RCT
is the same asPubMed_200k_RCT_numbers_replaced_with_at_sign
, except that in the latter all numbers had been replaced by@
. (same forPubMed_20k_RCT
vs.PubMed_20k_RCT_numbers_replaced_with_at_sign
).
- NaiveBiase Model -> 72% Accuracy
- Conv1D Model -> 78% Accuracy
- Model using pretrained token embedding ( Universal sentence embedding ) -> 75% Accuracy
- Conv1D Model using character level embedding -> 73% Accuracy
- Model with both token and charcter level embedding -> 76% Accuracy
- Model with token, character and position level embedding ( https://arxiv.org/pdf/1612.05251.pdf ) -> 81% Accuracy
- Model described in this paper with bert embedding -> 88% Accuracy
- Model trained on 200k dataset reached 87.86% Accuracy
- Tensorflow
- tensorflow_text
- tensorflow_hub
- sklearn
- Matplotlib
- numpy
- pandas
- spaCy
pip install -r requirements.txt
Download model, unzip it and move it to Skimit\data\model
:
- model trained on 20k >> f1= 0.846 https://drive.google.com/drive/folders/17DLDS6eqA_6oYpQJ1DGB171AtKFuVrjG?usp=sharing
- model trained on 200k >> f1= 0.880 https://drive.google.com/drive/folders/17DLDS6eqA_6oYpQJ1DGB171AtKFuVrjG?usp=sharing
Run
streamlit run app.py