dmis-lab / biobert Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 441.0 508 KB

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Home Page: http://doi.org/10.1093/bioinformatics/btz682

License: Other

Python 95.59% Perl 4.21% Shell 0.20%

biobert's People

Contributors

Stargazers

Watchers

Forkers

ml-lab geoslegend merajat npubird hsdt shubhampachori12110095 lucian-whu jmorken cclauss fw1121 amoliu allensmile henghuiz-zz 53x billpei yyht drorhilman zhangjiekui codeaudit mohan-zhang-u interoctiv gccome baiyuanxiang skakoon sophial05 stansonhealth jhyuklee xiongshufeng mrazakhan sweatyrichard treaston2 laprask zizai jeff-lewis mac-kim elhamdolatabadi lcy081099 ibrahimishag j1205 zhangdongxu adakasky mshapi2 keep-steady saitohugo gkoytiger foundation-models xhuang28 ehsanasgari rosssong qinqd akshr lamypark gregarious9612 tonydeep shiva1387 stjordanis fyuval canerbaran morgan-tam countrysideid qiuwei bin2000 darkhash ellyanalinden penpensun anavalo rajuranpe kwaku-frimpong balaprasanna nasumas iamvazu sumguyneedz mirodil jjdiaz24 romprakash arnoudspammer pjzoio seonchoe animesh serviolimareina monaalsh fai247 skhong0831 carrielui ithinkmfallin hubbucket-team neemax chrisgate hayscoding votamvan crosstuck terrywalker866 gkovaig prateeknagpal jabogithub ephchem absatasy iankim88 rezacsedu sema4-ericschles

biobert's Issues

ner_detokenize

Hi,

Thank you for your files and codes; they are very helpful as I am new to NLP.

I have processed my NER dataset with BioBert and have tried to get entity-level metrics using the provided ner_detokenize.py script. However I met with an error:
Error! : len(ans['labels']) != len(bert_pred['labels']) : Please report us

It seems that the error occurs because of two things:

Sometimes Bert's wordpiece tokenization does not result in adding a "##" in front of the word. E.g. I have a word token "2013-10-23" in test.tsv and Bert will tokenise it to "2013" "-" "10" "-" "23". Therefore combining words based on whether there is a "##" prefix is not fully accurate for me.
I tried to fix this by asking the script to also combine words if the label is 'X'. But this doesn't work all the time as the label is a prediction and not the true label.

I was wondering whether it will be possible to output the true labels from BioBert as well, or do you have any other suggestions/fixes?

Thank you!

pre-training from scratch

Hi,

I am trying to pre-train BERT kind of model on my dataset (not fine-tune) from scratch; similar to your work on BioBERT. I understand that I need scripts like create_pretraining_data.py, run_pretraining.py to create and then pretrain on my text corpus. Moreover, the text documents should be split sentence-wise separated by blank lines and this is the corpus to create the pre-trained model [similar to this].

I couldn't find a detailed write-up as to how exactly I should run the scripts and in what order. This article tried to sum up the steps required to create a pre-trained model from scratch, but I am still not convinced. The documentation for BioBERT is great anyway, but I was wondering if I could get a more elaborate description of the steps followed to create models similar to BioBERT. It would be extremely helpful for my work.

Thanks.

can't find `BioASQ-train-4b.json` for run_qa.py example

while running

python run_qa.py \
     --do_train=True \
     --do_predict=True \
     --vocab_file=$BIOBERT_DIR/vocab.txt \
     --bert_config_file=$BIOBERT_DIR/bert_config.json \
     --init_checkpoint=$BIOBERT_DIR/biobert_model.ckpt \
     --max_seq_length=384 \
     --train_batch_size=12 \
     --learning_rate=5e-6 \
     --doc_stride=128 \
     --num_train_epochs=5.0 \
     --do_lower_case=False \
     --train_file=$BIOASQ_DIR/BioASQ-train-4b.json \
     --predict_file=$BIOASQ_DIR/BioASQ-test-4b-1.json \
     --output_dir=/tmp/QA_output/

I can't find BioASQ-train-4b.json in the referred files.
I can find BioASQ-train-factoid-4b.json

but when I use it with batch size = 12 , I get the following error:

ValueError: train batch size 12 must be divisible by number of replicas 8

BioASQ-test issues

as the links:
https://github.com/dmis-lab/biobert#question-answering-qa
pre-processed version of BioASQ-4/5b datasets, the answer in BioASQ doesn't offer any answer start information, how can you get this information like answer start and the answer in the test file?

Reproducing BioASQ Results

Hi,
First of all, thanks for the BioBERT models.
I was trying to reproduce the results of BioASQ 6b with BioBERT (+PubMed + PMC) as mentioned in Table 6. which states S=35.58 / L=51.39 / M=42.51. I'm using hugging face's run_squad.py script. I tried two settings, one without fine-tuning pre-trained model on SQuAD and other with fine-tuning.

For SQuAD training I used this command:

export BERT_MODEL=/path/to/pubmed_pmc_470k
export SQUAD_DIR=/path/to/squad

python run_squad.py \
  --bert_model $BERT_MODEL \
  --do_train \
  --do_predict \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 6 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /path/to/squad_tuned_output

Then, I used fine-tuned model to train on 6b-train.

export BERT_MODEL=/path/to/squad_tuned_output
export BIOBERT_BIOASQ_DIR=/path/to/biobert-bioasq-datafiles

python run_squad.py \
  --bert_model $BERT_MODEL \
  --do_train \
  --train_file $BIOBERT_BIOASQ_DIR/BioASQ-train-factoid-6b.json \
  --train_batch_size 6 \
  --learning_rate 5e-6 \
  --num_train_epochs 5.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /path/to/output

Below I summarize results for each test batch of 6b (1-5) for both cases:

Without SQuAD fine-tuning

  S     L     M 
29.03 54.83 39.08 | 6b1
28.57 66.66 45.63 | 6b2
37.50 56.25 43.54 | 6b3
27.27 48.48 33.58 | 6b4
36.36 50.00 40.90 | 6b5
------------------------
31.74 55.24 40.54 | Avg

With SQuAD fine-tuning

  S     L     M 
29.03 64.51 43.38 | 6b1
38.09 76.19 52.77 | 6b2
46.87 62.50 53.12 | 6b3
39.39 48.48 42.17 | 6b4
38.63 52.27 43.71 | 6b5
-----------------------
38.40 60.79 47.03 | Avg

Assuming paper is reporting results with average on all 5 batches as well then the results I'm able to reproduce are giving a boost of +2.82 / +9.4 / +4.52 for S, L and M respectively.

Would you comment on this, do you think its due to some implementation differences between TensorFlow and PyTorch implementations? or if you see an issue with the way I reproduced the results. Imo, the difference between results is very high for L and M and If you're able to reproduce the results, that I have reported, then maybe it'll be helpful to update arXiv version of paper for other researchers working on same problems to collect the right SOTA.

Thanks!

Task Fine Tuning Parameters

Would you be able to provide your exact parameter setup for the individual task fine tuning. It is unclear from the paper what was used for the reported numbers. Also between v2 and v3 of the paper you mentioned trying different ranges (v2 mentions epochs but in v3 you switched to mentioning learning rates).

Thanks

Tony

BioASQ datasets issues

I find that the data used in test data are SQuAD form , and the context used are not the same as the snippets in task B-Phase B test files, I also read your paper, and confused of the data preprocessing, so i wonder where and how you get the source of the contexts?

Label leakage in construction of lines for run_ner.py?

Hi,

I think there is a potential label leakage in the run_ner.py in that it's leaking label information into the training data.

https://github.com/dmis-lab/biobert/blob/master/run_ner.py#L158 in particular causes information about the label to be leaking into the structure of the examples. For example, consider a dataset where each sentence only has a single O. That line above would make it obvious where the O is because the O would deterministically be in the same location.

I did some ablations and in practice it doesn't seem to affect things too badly, but this might be a good thing to fix.

Can't get word embedding

Hi, I am trying to get a word embedding vector for BioBERT, and compare it with the word embedding vector I get from BERT.

However, I haven't been successful in running BioBERT.

I have downloaded the weights from release v1.1-pubmed and after unzipping the weights into a folder, I run the following code

`out = open('prepoutput.json', 'w')

import os

os.system('python3 "/content/biobert/extract_features.py"
--input_file= "/content/biobert/sample_text.txt"
--vocab_file= "/content/biobert_v1.1_pubmed/vocab.txt"
--bert_config_file= "/content/biobert_v1.1_pubmed/bert_config.json"
--init_checkpoint= "/content/biobert_v1.1_pubmed/model.ckpt.index"
--output_file= "/content/prepoutput.json" ')`

The output is "256" and the file "preoutput.json" is empty.

Please guide me.

Unfortunately, my attempts at converting the weights from Pytorch wasn't successful either.

Question regarding BioASQ training dataset

Dear Biobert authors,

We are trying to reproduce some of the results from the BioASQ QA task. In Table 6 of the BioBert paper the number of training data points in the 4B, 5B, 6B datasets are 327, 486, 618 respectively. However in the json files, BioASQ-train-factoid-4b.json, BioASQ-train-factoid-5b.json and BioASQ-train-factoid-6b.json I find there are 209, 325 and 439 unique questions respectively. Why is there a difference between these?

Are the numbers quoted in Table 6 prior to the unanswerable questions being removed? Referencing this passage: "However, we observed that about 30% of each BioASQ factoid
dataset was unanswerable in the extractive setting as the exact answer
phrases did not appear in the given passages. Like Wiese et al. (2017), we
excluded the unanswerable questions from the training sets, but regarded
them as wrong predictions in test sets"

Thank You,

Kuhan

python file to run for fine-tuning on Squad

Hello,
I am confused, Do I Need to execute 'run_qa.py' file or 'run_squad.py' for fine-tuning the pre-trained model on Squad.

run_ner lines 154-164. Why '30'?

Hi, congratulations for this amazing work!
I am playing with the NER code you provided and I am trying to understand the lines 154-164. If I remove this block the code fails (in detokenizer), we basically miss some words. So why do we need to split a sentence in 30-word-long sub-sentences? Is it some requirement by BERT? I also use max_seq_length 128 so it does not make sense to me. Thanks!

Dimension size must be evenly divisible by N but is M for 'loss/Reshape_1'

Hello, I am trying to use BioBERT to fine tune i2b2-2010 dataset which has three entities to be extracted: treatment, test, and problem. The only code I changed in run_ner.py is get_labels function in class NerProcessor. I changed it to below:

def get_labels(self):
    return ["O", "X", "[CLS]", "[SEP]", "B-test", "I-test", "B-problem", "I-problem", "B-treatment", "I-treatment"]

I also changed training batch size to 16 due to my GPU memory.
However, when I run run_ner.py, I got the following error:

Traceback (most recent call last):
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1628, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension size must be evenly divisible by 896 but is 22528 for 'loss/Reshape_1' (op: 'Reshape') with input shapes: [2048,11], [3] and with input tensors computed as partial shapes: input[1] = [?,128,7].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_ner.py", line 630, in <module>
    tf.app.run()
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run_ner.py", line 555, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2409, in train
    rendezvous.raise_errors()
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2403, in train
    saving_listeners=saving_listeners
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
    features, labels, mode, config)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2479, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1259, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1533, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_ner.py", line 412, in model_fn
    num_labels, use_one_hot_embeddings)
  File "run_ner.py", line 382, in create_model
    logits = tf.reshape(logits, [-1, FLAGS.max_seq_length, 7])
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1792, in __init__
    control_input_ops)
  File "/home/daniel/PycharmProjects/biobert/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1631, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension size must be evenly divisible by 896 but is 22528 for 'loss/Reshape_1' (op: 'Reshape') with input shapes: [2048,11], [3] and with input tensors computed as partial shapes: input[1] = [?,128,7].

I can successfully run NCBI-disease dataset. So is it because of the different labels?
Any idea why this is happening and how I can fix it? Any help is highly appreciated!

Reproducing RE scores from the paper

I'm trying to reproduce the RE scores from the paper for chemprot with F1 = 76.17 and have been unsuccessful so far using this checkpoint BioBERT v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary) https://drive.google.com/file/d/1R84voFKHfWV9xjzeLzWBbmY1uOMYpnyD/view?usp=sharing

Is there a recipe that worked best or how were the scores computed? Thanks!

BERT Large or BERT Base?

Hello,
May I know if BIOBERT is based on Bert Base or Bert Large. I am sorry if you have already mentioned this information.

What Metric exactly is reported/computed for NER?

Hi,

I am reading through the code and implementation and was wondering what exactly R/P/F1 relate to.
In particular:
Are the metrics at the token level or the entity level? Are you accounting only for exact matches or do you give some credit for partial matches as well?
Indeed, the paper is not very clear about what exactly those metrics refer to. I'm not very familiar with tensorflow but it seems to me that the code is computing token level metrics? Or is the text "detokenized" and grouped by entity before evaluation is run? Are predictions only run for the first token in a word like in the CONLL 2003 setup of BERT?

In particular, I noticed that your implementation is based on: https://github.com/kyzhouhzau/BERT-NER which mentions that the scores differ from normal evaluation (e.g: using conlleval.pl). Is this also the case for the numbers reported in the paper?

Thanks in advance for your answer!

Why BioBERT is not up on paperswithcode SOTA leaderboard

The leaderboard for NER tasks on paperswithcodes is here.
Even after achieving state of the art scores in multiple datasets, why is BioBERT not on those leaderboards?

Biobert text classification

Hello,

Thanks for providing these useful resources.

I saw the code of run_classifier.py is the same as the original Bert repository, I guessed running text classification using BioBERT is the same as using original Bert. Have you tried some bio-medical text classification task? Just curious about how much gain we can get from BioBERT.

BioBert for Pytorch

Hello Everyone

I tried to convert the keras model to Pytorch just like explained in this link:
https://github.com/huggingface/pytorch-pretrained-BERT#command-line-interface

I am using the following code:
`
import os
from pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch import convert_tf_checkpoint_to_pytorch

if (not os.path.exists('/my_path/pytorch_model.bin')):
convert_tf_checkpoint_to_pytorch(
'/my_path/biobert_model.ckpt',
'/my_path/bert_config.json',
'/my_path/pytorch_model.bin'
)
`

I got the following error:

Traceback (most recent call last): File "/.../pytorch_pacrr_and_posit_drmm/convert_bert_model_to_pytorch.py", line 9, in <module> '/home/dpappas/Downloads/F_BERT/Biobert/pubmed_pmc_470k/pytorch_model.bin' File "/usr/local/lib/python3.6/site-packages/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py", line 69, in convert_tf_checkpoint_to_pytorch pointer = getattr(pointer, l[0]) AttributeError: 'Parameter' object has no attribute 'BERTAdam'

Could anyone help me?
Thank you in advance.

Replication of Relation Extraction results - share finetuning parameters

Hi,
congratulations for this really interesting work!
I am trying to replicate your results on relation extraction (so far tried GAD and euadr) and (perhaps I am missing something) I cannot replicate the numbers you report. I mostly play with the number of epochs, but what I see is:
a) largely varying metrics values across runs
b) even when I get a decent precision/recall/f1 value, if I inspect the predictions they seem quite random (all values are around 0.5).
Furthermore, I tried to run this on another popular benchmark (CDR) and again I get the same strange behaviour.
Is it possible to share the exact parameters you used to finetune your relation extraction models?
PS: I had tried 1-2 months ago to modify BERT directly to do relation extraction, but was not very happy with the results.

BioBert NER only for Prediction

Hi,

I would like to use BioBert to build a NER model to extract biomedical concept. I have successfully trained the model. Since I will need to process millions of PMC articles, would I turn off train and eval, but only do prediction to speed up?

"ValueError("At least one of do_train or do_eval must be True.")"

Thanks!

NER datasets

Hi,
Thank you for providing the preprocessed datasets for NER task. But I had a doubt. The BIO tags in the datasets do not contain B-class_name (e.g. B-disease) but contain only BIO tags without the further class specification.
Is this how BioBERT was trained or am I missing something?

ChemProt trainset, develop set, test set in Re_data ?

I can not find chemprot train in REdata where only euadr and GAD sets left, where are those chemprot sets?
thanks

biobert predicts only one class on test set

Hi.
I am performing a binary text classification task using 'run_classifier.py'. When I look at the predicted probabilities for class 0 and 1 (in 'test_results.tsv'), i see that the predicted probabilities are very similar for all data points and biased towards class 0. The output looks like this-

0.9994677 0.00053235615
0.9994677 0.00053235615
0.9994677 0.00053235615
0.9994677 0.00053235615
0.9994677 0.00053235615
0.9994677 0.00053235615
0.9994677 0.00053235615

This is my config file-
python run_classifier.py
--task_name=cola
--do_train=true
--do_eval=true
--do_predict=true
--data_dir=./data_cla_abs/
--vocab_file=./pretrained_weights_BIOBERT_DIR/biobert_v1.0_pubmed_pmc/biobert_v1.0_pubmed_pmc/vocab.txt
--bert_config_file=./pretrained_weights_BIOBERT_DIR/biobert_v1.0_pubmed_pmc/biobert_v1.0_pubmed_pmc/bert_config.json --init_checkpoint=./pretrained_weights_BIOBERT_DIR/biobert_v1.0_pubmed_pmc/biobert_v1.0_pubmed_pmc/biobert_model.ckpt
--max_seq_length=512
--train_batch_size=2
--learning_rate=2e-5
--num_train_epochs=5.0
--output_dir=./bert_output_cla_abs/
--do_lower_case=False

The data is pubmed abstacts. I am trying to classify pubmed abstracts into into 2 classes.

Happy to share more info.

Kindly help me understand the cause of this problem.

Thanks.

Retreiving Embedding vector

Can anyone suggest me how to get embedding vector using Biobert ? What exactly i am looking at is if I give text input to it, then I want back embedding vector for (sentence) or embedding vector (for word) . Any of these will work for me ?

Reproducing RE results using CHEMPROT datasets

Dears,
Thank you very much for making your contribution open to the public. I am trying to reproduce your results of the Relation Extraction. Looking at the run_re.py inside the BioBERTChemprotProcessor class, I noticed you used five class labels. Namely, when looking at the list returned by get_labels method at line 438, I can see the classes are : ["cpr:3", "cpr:4", "cpr:5", "cpr:6", "cpr:9", "false"]. My Question is:
Does "false" as a class label stand for the non-relation case?, or it refers to another type of relation like "cpr:10"?.

I have got access to the CHEMPROT dataset from SCiBERT [https://github.com/allenai/scibert]

What version of tensorflow/CUDA was used?

I am on tf 1.12 (python 3.6) and the following cuda version and drivers.

biobert$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
biobert$ NVIDIA-SMI 384.130

When I run extract_features.py I get:

...tensorflow/python/client/session.py"
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

I couldn't find the details in the paper, sorry.

Pre-training for SQUAD and fine-tuning for BIOASQ

Hello,

Can you please explain how exactly can one replicate the state of the results on BIOASQ dataset, as you report in the paper that you pre-train on SQUAD before fine tuning on BIOASQ. I tried doing the same steps reported in this repo and found the results to be quite low if not pre-trained on SQUAD, which is normal.

Is there any doc which explains the commands to do this? I'm sorry if I missed reading this.

Thanks

adding additional features for QA task

Hi,
my goal is to extract contextual word embeddings from last layer of BioBert and add additional features, before performing QA fine-tuning. please suggest me how to achieve this. also I found that when I print 'final_hidden = model.get_sequence_output()' to console, tensor shape is getting printed instead of contextual embeddings. yours suggestions are appreciated.

Fine tuning BioBERT on CPU

Hi Folks,

I know the requirements.txt mentions gpu version of tensorflow , however currently i do not have access to gpu so i am running fine-tuning on a cpu. I am currently running ner with the NCBI-disease dataset(training).

The output directory of /tmp/bioner has checkpoint till 13k till now .I had started the process approximately 24 hour ago. Can any one suggest how much time it takes on a gpu and if possible for cpu too .

I had changed the batch size to 4 from 32 for training.

Thanks

BioBERT base model files

I had a question regarding the base bioBERT model files you have released, and some of the file sizes. I downloaded the "pubmed_pmc_470k" set and noted the checkpoint file (biobert_model.ckpt.data-00000-of-00001) is ~1.3G. This looks more like the size seen with the large BERT models compared to the ~440MB small base BERT models (the multilingual ones appeared to be about ~700MB. The config file and vocab files appear to match with what was seen in the smaller BERT base model. I was curious incase the file included fine tuning on one of the tasks or really the base bioBERT, i'm pretty sure it doesn't and it looks like they might just contain the optimizer parameters.

Could you also confirm whether you pre-trained bioBERT from scratch using the various corpora, or did you initialise it with one of the BERT base models and only continued pre-training with the biomedical corpora (if so could please indicate exactly which one, I don't believe you have used a multilingual based on vocab etc). One other thing you may wish to consider adding to your paper is the difference between case and uncased variants, whether you tested this or not and that you are using the cased variant at the moment it seems.

Thanks again

Tony

What are the PubMed & PMC text that you've used?

I want to access the same variants of PubMed & PMC articles you've used in the pre-processing.
Are they available to download somewhere?
I've checked the NLM site but there are lots of versions and options.

Thanks

Why does vocab not appear to be medically oriented?

I was trying to diagnose why BioBert is underperforming a standard Bert model for a PubMed task. I took a look at the vocab.txt, and noticed that the words are not what one would expect from PubMed word frequencies.

grep flix vocab.txt returns "Netflix" (46 PubMed occurrences) but not "infliximab" (>13000 PubMed occurrences)

Why is this?

i2b2 dataset results

Hi, thanks for making your code and pre-trained weights publicly available.

I have a question regarding the i2b2 dataset results in Table 4 of your paper.

In your paper, the paper you cite as SOTA for the i2b2 dataset (Zhu 2018) seems to be predicting all entity types (problem, treatment and test). However, in your code:

biobert/biocodes/ner_detokenize.py

Line 85 in 7a3c96e

 out_.write("%s %s-MISC %s-MISC\n"%(bert_pred['toks'][idx], ans['labels'][idx], bert_pred['labels'][idx])) 

you seem to be ignoring entity types before evaluation (MISC) - so if I understand correctly the model isn't penalised if it gets the label type wrong? Also, are you using all 3 entity types for fine tuning on i2b2?

Distilled Version of biobert

Hello, thanks for your great work! Do you have any plans on providing a distilled version (https://medium.com/huggingface/distilbert-8cf3380435b5) of BioBERT?

Is it possible to output probability score for each token?

Hello, most of the time it is important to also output the maximum probability score for each token prediction in NER. So I am just wondering can we output such score in run_ner.py?

NER and RE Prediction for raw sentence

Thank you for the useful tools.

I saw the prediction mode (--do_predict=true) is for evaluation, so how to apply our own data for the inference? Could you please give me some hints?

training on Squad

Hello @jhyuklee ,
May I know what are the parameters you have used while training it on Squad. like epochs and learning rate..

NER dataset

Dear BioBERT crew,
Again thank you so much for providing this great contribution and the recent BERN. My question is: What is the purpose of 'train.tsv' file?
Because it is not used in the run_ner.py. Please explain this.

Thanks in advance,
Ibrahim.

NER predict only mode

I would like to use NER model I fine-tune with

python run_ner.py \
    --do_train=true \
    --do_eval=true \
...

to make prediction on sentences without labels (without evaluation of course)

But

python run_ner.py \
    --do_train=false \
    --do_eval=false \
    --do_predict=true \
...

gives me error:

raise ValueError("At least one of `do_train` or `do_eval` must be True.")
ValueError: At least one of `do_train` or `do_eval` must be True.

How can I make predictions without having evaluation?

extract_features.py

Hello,
I am to trying to extract contextual embeddings by feeding train.json to 'extract_features.py'.

example: my input file(train.json file) is as follows:

{
"version":"BioASQ6b",
"data":[
{
"title":"BioASQ6b",
"paragraphs":[
{
"context":"Spermidine protects against \u03b1-synuclein neurotoxicity. As our society ages, neurodegenerative disorders like Parkinson`s disease (PD) are increasing in pandemic proportions. While mechanistic understanding of PD is advancing, a treatment.",
"qas":[
{
"question":"What is the association of spermidine with \u03b1-synuclein neurotoxicity?",
"id":"56c073fcef6e394741000020",
"answers":[
{
"text":"Spermidine protects against \u03b1-synuclein neurotoxicity",
"answer_start":0
}
]
}
]
}]}]}

how ever running 'extract_features.py' should only give contextual embeddings of words that are present as part of 'question' ,'context' right?(please correct me if I am wrong), but the result of running 'extract_features.py' is returning an output file with embeddings of all tokens that are present in the input train.json file, like '{', '[', ':',"qas" and meta data like "BioASQ6b". please let me know how can I only get contextual word embeddings of words that are present as part of 'question' and 'context so that I can feed it to my model.

word embedding vector

Hello,
May I know what all text resources(corpus) you have used in building word-embedding vector.

Thanks,
Sai Krishna

An analysis paper of BioBERT and BioELMo

(Sorry to advertise our paper here)

Hi all, we have an analysis paper: Probing Biomedical Embeddings from Language Models, where we show how biomedical domain adaptation works for contextualized embeddings using probing tasks. Expectedly, fine-tuned BioBERT outperforms BioELMo in biomedical NER and NLI tasks. However, as fixed feature extractors BioELMo seems to be superior than BioBERT in our probing tasks.

In addition, we have BioELMo available at https://github.com/Andy-jqa/bioelmo.

NER predictions from pretrained models

I'm trying to run NER on a dataset using a pretrained model. It wouldn't let me run this without either --do_eval=true or --do_train=true, and I used the advice in this comment to fix it. However, it appears that predicting depends on a label2id.pkl, that can only be generated when either of the two flags above are true. My next attempt was to copy a devel.tsv from one of the preprocessed datasets to my data directory and set --do_eval=true, but I wonder if there's a better way to do this.

Further, I get this output at the end:

INFO:tensorflow:***** Running prediction*****
I1023 03:59:30.193573 140324722628480 run_ner.py:595] ***** Running prediction*****
INFO:tensorflow:  Num examples = 0
I1023 03:59:30.193796 140324722628480 run_ner.py:596]   Num examples = 0
INFO:tensorflow:  Batch size = 8
I1023 03:59:30.193949 140324722628480 run_ner.py:597]   Batch size = 8
Traceback (most recent call last):
  File "run_ner.py", line 646, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_ner.py", line 610, in main
    with open(token_path, 'r') as reader:
FileNotFoundError: [Errno 2] No such file or directory: './tmp/token_test.txt'

So I have two questions:
(a) My test.tsv is just rows of sentences: is this correct?
(b) I'm not sure how I'm supposed to get the token_test.txt file to exist?

DDI

Hi,
I have seen in the code that you have also implement BioBert for the DDI-2013 relation extraction task. Did you actually test your model on this dataset (that I can't find in your repository) too? What were the results?

Thanks,
Luca

downloading datasets

I'm unable to download the datasets linked in the repo. The link takes me to gofile.me, but then I get an error when I click download. Is anyone else having trouble with this?

Pre-training vs fine-tuning?

First thanks for publicly releasing BioBERT model. I have read your paper titled, "Pre-trained Language Model for Biomedical Question Answering" (https://arxiv.org/abs/1909.08229) and it is nice.

In the paper, it was mentioned like

We leverage BioBERT to address these issues. To mitigate the small size of datasets, we first fine-tune BioBERT on other large-scale extractive question answering datasets, and then fine-tune it on BioASQ datasets
BioBERT first pre-trained on SQuAD and then fine-tuned on BioASQ 6b obtained the best
performance over other two experiments, demonstrating the effectiveness of pretraining BioBERT on SQuAD, a comprehensive and large-scale question answering corpus.

In 1) you have mentioned like, "fine tuning on SQuAD and then fine tuning on BioASQ", in 2) you have mentioned like, "first pre-trained on SQuAD and then fine-tuned on BioASQ".

In general, pre-training means, BERT model is trained on unlabeled text using Masked Language Model and Next Sentence Prediction tasks. Fine-tuning means, a task specific layer is added on the top of BERT and all the parameters are fine-tuned using labeled data set

I would like to know, whether you have first pre-trained or fine-tuned BioBERT on SQuAD?

Thanks in advance. Correct me if i'm wrong.

Load Biobert pre-trained weights into Bert model with Pytorch bert hugging face run_classifier.py code

These are the steps I followed to get Biobert working with the existing Bert hugging face pytorch code.

I downloaded the pre-trained weights 'biobert_pubmed_pmc.tar.gz' from the Releases page.
I ran this command to convert the tf checkpoint to pytorch model

python pytorch-pretrained-BERT/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path="biobert/pubmed_pmc_470k/biobert_model.ckpt.index" --bert_config_file="biobert/pubmed_pmc_470k/bert_config.json" --pytorch_dump_path="biobert/pubmed_pmc_470k/Pytorch/biobert.model"

This created a file 'biobert.model' in the specified path.

As mentioned in this link , I compressed 'biobert.model' created above and 'biobert/pubmed_pmc_470k/bert_config.json' together into a biobert_model.tar.gz
I then ran the run_classifier.py of hugging face bert with the following command, using the tar.gz created above.

python pytorch-pretrained-BERT/examples/run_classifier.py --data_dir="Data/" --bert_model="biobert_model.tar.gz" --task_name="qqp" --output_dir="OutputModels/Pretrained/" --do_train --do_eval --do_lower_case

I get the error

'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte'

in the line

tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)

Am I doing something wrong?

I just wanted to run run_classifier.py code provided by hugging face with biobert pretrained weights in the same way that we run bert with it. Is there a way to do this?

Fine tune BioBERT on multiple new categories for NER task

Thank you for this amazing implementation. I'm wondering, how tricky it is possible to extend (by fine tuning on specific corpora) IOB tags to new multiple categories?

For example, I would to fine tune BioBERT on my corpus with three entity categories (a, b, c). Normally, I would label my data as [B-a, I-a, B-b, I-b, B-c, I-c, O, X, [CLS], [SEP]], however, when I tried to modify your run_ner.py, it fails.

Any suggestions how to implement multiple NER taks with BioBERT?

Extra checkpoint initialisation in run_ner.py (Minor)

It looks like in the run_ner.py script you have has an extra call to initialise from a checkpoint (which will cause it to be attempted twice, hopefully without side-effects). Line 416 looks to be unnecessary.

Thanks

Tony

dmis-lab / biobert Goto Github PK

biobert's People

Contributors

Stargazers

Watchers

Forkers

biobert's Issues

example: my input file(train.json file) is as follows:

Recommend Projects

Recommend Topics

Recommend Org

Jobs