GithubHelp home page GithubHelp logo

jamie's Introduction

JaMIE: a Japanese Medical Information Extraction toolkit

Joint Japanese Medical Problem, Modality and Relation Recognition

In the field of Japanese medical information extraction, few analyzing tools are available and relation extraction is still an under-explored topic. In this paper, we first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We design a system with three components for jointly recognizing medical entities, classifying entity modalities, and extracting relations.

JaMIE system

Installation (python3.8)

git clone https://github.com/racerandom/JaMIE.git
cd JaMIE \

Required python package

pip install -r requirements.txt

Mophological analyzer required:

mecab (juman-dict) by default
jumanpp

Pretrained BERT required for training:

NICT-BERT (NICT_BERT-base_JapaneseWikipedia_32K_BPE)

Pre-processing: Batch Converter from XML (or raw text) to CONLL for Train/Test

The Train/Test phrases require all train, dev, test file converted to CONLL-style before Train/Test. You also need to convert raw text to CONLL-style for prediction, but please make sure the file extension is .xml.

python data_converter.py \
--mode xml2conll \
--xml $XML_FILES_DIR \
--conll $OUTPUT_CONLL_DIR \
--cv_num 0 \ # 0 presents to generate single conll file, 5 presents 5-fold cross-validation
--doc_level \ # generate document-level ([SEP] denotes sentence boundaries) or sentence-level conll files
--segmenter mecab \ # please use mecab and NICT bert currently
--bert_dir $PRETRAINED_BERT # Pre-trained BERT or Trained model

Train:

CUDA_VISIBLE_DEVICES=$GPU_ID python clinical_joint.py \
--pretrained_model $PRETRAINED_BERT \ # downloaded pre-trained NICT BERT
--train_file $TRAIN_FILE \
--dev_file $DEV_FILE \
--dev_output $DEV_OUT \
--saved_model $MODEL_DIR_TO_SAVE \ # the place to save the model
--enc_lr 2e-5 \
--batch_size 4 \ # depends on your GPU memory
--warmup_epoch 2 \
--num_epoch 20 \
--do_train \
--fp16 (apex required)

Test:

We share the models trained on radiography interpretation reports of Lung Cancer (LC) and general medical reports of Idiopathic Pulmonary Fibrosis (IPF):

You can either train a new model on your own training data or use our shared model for test.

CUDA_VISIBLE_DEVICES=$GPU_ID python clinical_joint.py \
--saved_model $SAVED_MODEL \ # Where the trained model placed
--test_file $TEST_FILE \
--test_output $TEST_OUT \
--batch_size 4

Batch Converter from predicted CONLL to XML

python data_converter.py \
--mode conll2xml \
--xml $XML_OUT_DIR \
--conll $TEST_OUT

Annotation Guideline of the training data (XML format)

We offer the links of both English and Japanese annotation guidelines.

TO-DO

Recognition accuracy can be improved by leverage more training data or more robust pre-trained models. We are working on making the code compatible with Japanese DeBERTa

Questions

If you have any questions related to the code or papers, please feel free to send a mail to Fei Cheng: [email protected] or [email protected]

Citation

If you use our code in your research, please cite the following papers:

@inproceedings{cheng-etal-2022-jamie,
   title={JaMIE: A Pipeline Japanese Medical Information Extraction System with Novel Relation Annotation},
      author={Fei Cheng, Shuntaro Yada, Ribeka Tanaka, Eiji Aramaki, Sadao Kurohashi},
         booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC 2022)},
            year={2022}
}
@inproceedings{cheng2021jamie,
   title={JaMIE: A Pipeline Japanese Medical Information Extraction System},
      author={Fei Cheng, Shuntaro Yada, Ribeka Tanaka, Eiji Aramaki, Sadao Kurohashi},
         booktitle={arXiv},
            year={2021}
}
@inproceedings{yada-etal-2020-towards,
   title={Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases},
      author={Shuntaro Yada, Ayami Joh, Ribeka Tanaka, Fei Cheng, Eiji Aramaki, Sadao Kurohashi},
         booktitle={Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC 2020)},
            year={2020}
}
@inproceedings{cheng-etal-2020-dynamically,
   title={Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning},
      author={Fei Cheng, Masayuki Asahara, Ichiro Kobayashi, Sadao Kurohashi},
         booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Findings Volume},
            year={2020}
}

jamie's People

Contributors

racerandom avatar

Stargazers

 avatar Lisa Raithel avatar  avatar Yoshimura avatar STYLIANOS IORDANIS avatar Hirokazu Kiyomaru avatar Wang An avatar Shuntaro Yada avatar

Watchers

James Cloos avatar  avatar

jamie's Issues

Bug in Batch Converter from predicted CONLL to XML

Despcription
Thanks for your great work!
After trying the toolkit, I've encountered a problem when converting predicted CONLL file to XML format.
It is assumed that 2 dictionaries in data_objects.py: span2tid and span2rel should have the same keys, however, the real output from the model does not always follow the rule.
The detailed input data and error message are listed as below.
I'd appreciate your guidance on finding a solution.

CONLL file example

#doc ./data/test_1.xml
0	1	B-t-key	_	['N']	[0]
1	秒	I-t-key	_	['N']	[1]
2	量	I-t-key	_	['on']	[2]
3	低下	I-t-key	_	['N']	[3]
4	率	I-t-key	_	['on']	[4]  

Error message

Traceback (most recent call last):
  File "data_converter.py", line 157, in <module>
    conll_to_xml(args.conll_dir, args.xml_dir)
  File "data_converter.py", line 108, in conll_to_xml
    doc_conll.doc_to_xml(xml_out)
  File "/home/aisd/heart/jamie/data_objects.py", line 305, in doc_to_xml
    tail_tid, tail_tag = span2tid[tail_span]
KeyError: (2, 3)

XML data format

Thanks for your great work.
The first step is to convert XML files to CONLL files for Train/Test.
However, I don't know what's the XML data format for this step.
Could you give an annotated simple example?
Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.