GithubHelp home page GithubHelp logo

vlsp2020-relex's Introduction

Relation Extraction at VLSP 2020

(C) Copyright by Pham Quang Nhat Minh

Official code for the paper An Empirical Study of Using Pre-trained BERT Models for Vietnamese Relation Extraction Task at VLSP 2020, VLSP 2020. In the paper, we used BERT-based models for Vietnamese Relation Extraction. Our system ranked second in the VLSP 2020 shared task.

Details of the models and experimental results can be found in the following paper.

@inproceedings{pham2020empirical,
title={An Empirical Study of Using Pre-trained BERT Models for Vietnamese Relation Extraction Task at VLSP 2020},
author={Pham, Minh Quang Nhat},
booktitle={Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing},
pages={13--18},
year={2020}
}

Requirements

We tested the code with:

  • Python 3.7.10
  • Pytorch 1.8.1
  • Hugging Face Transformers 4.6.0
  • googledrivedownloader (pip install googledrivedownloader).
  • Modified version of pyvi from the forked repo for word segmentation with syllablized texts.

Evaluation Results

NOTE: We added evaluation results with PhoBERT pre-trained model. These results were not reported in the paper. Experiments with PhoBERT is quite tricky because we need to fixed word segmentation errors in which an entity mention contains syllables of another words.

We did not submit results on test-set with PhoBERT for official evaluation, so we only have evaluation results on development set with PhoBERT.

Model Pre-trained Model MACRO F1 MICRO F1
R-BERT NlpHUST/vibert4news 0.6392 0.7092
R-BERT vinai/phobert-base 0.6635 0.7228
R-BERT FPTAI/vibert 0.596 0.6736
BERT-ES NlpHUST/vibert4news 0.6439 0.7101
BERT-ES vinai/phobert-base 0.6651 0.7262
BERT-ES FPTAI/vibert 0.596 0.6736
Ensemble NlpHUST/vibert4news 0.6412 0.7108
Ensemble vinai/phobert-base 0.6687 0.7299
Ensemble FPTAI/vibert 0.6029 0.6851

On the development set, ensemble model + PhoBERT obtained the best results among models.

Data Preparation

From the original Vietnamese relation extraction dataset provided by VLSP 2020 organizers, we converted it to SemEval format. In SemEval format, a file is a list of lines and each line is tab delimitted with following information.

<label> <entity1_start_index> <entity1_end_index> <entity2_start_index> <entity2_end_index> <entity1_type>  <entity2_type> <tokenized_sentence>

An example of a sample

3	13	13	17	18	LOCATION	LOCATION	Đây là sự kiện quan trọng đối với ngành chăn nuôi gà vùng ĐBSCL nói riêng , Việt Nam nói chung .

In order to do that, we need to use VnCoreNLP for sentence segmentation and tokenization.

python data_converter.py --overwrite_output_dir --input_dir ./data/VLSP2020_RE_training --output_dir preprocessed_data/VLSP2020_RE_training

python data_converter.py --overwrite_output_dir --input_dir ./data/VLSP2020_RE_dev --output_dir preprocessed_data/VLSP2020_RE_dev

The directory preprocessed_data/VLSP2020_RE_training is then used to generate data in SemEval format.

Use the script semeval_converter.py for converting data into SemEval format.

python semeval_converter.py  --input_dir ./preprocessed_data/VLSP2020_RE_training --output_dir ./data/VLSP2020_RE_SemEvalFormat

python semeval_converter.py  --input_dir ./preprocessed_data/VLSP2020_RE_dev --output_dir ./data/VLSP2020_RE_SemEvalFormat

Now, directory ./data/VLSP2020_RE_SemEvalFormat contains necessary files for training/evaluation.

  • train.txt
  • dev.txt
  • id2label.txt

id2label.txt contains the mapping from indexes to labels.

0	AFFILIATION
1	LOCATED
2	OTHER
3	PART – WHOLE
4	PERSONAL - SOCIAL

Transforming data into word-segmented data for PhoBERT experiments is a bit tricky because we need to fix entity boundary errors. You have to install pyvi package from forked repo here. Then run the script prepare_data4phobert.py as follows.

python prepare_data4phobert.py --input_file ./data/VLSP2020_RE_SemEvalFormat/dev.txt --id2label ./data/VLSP2020_RE_SemEvalFormat/id2label.txt

python prepare_data4phobert.py --input_file ./data/VLSP2020_RE_SemEvalFormat/train.txt --id2label ./data/VLSP2020_RE_SemEvalFormat/id2label.txt

NOTE: You need to apply for VLSP 2020 - Relex dataset. See VLSP 2020 homepage for instructions.

Downloading cached feature files

Due to the copyright of VLSP 2020 Relex data, we could not share data. However, to reproduce the results in the paper, we can use cached features extracted from the original data. Download cached features using the following command.

sh download_data.sh

Training and Evaluation

BERT-ES (BERT Entity Markers + Entity Starts)

With NlpHUST/vibert4news-base-cased pre-trained model.

python run_bert_em.py --model_dir ./models/original_train_dev/bert_em_es_bert4news_maxlen_384_epochs_10 \
                      --model_name_or_path NlpHUST/vibert4news-base-cased \
                      --train_data_file ./data/cached_features/train.txt  \
                      --eval_data_file ./data/cached_features/dev.txt \
                      --id2label ./data/cached_features/id2label.txt  \
                      --train_batch_size 8 --gradient_accumulation_steps 2 \
                      --save_steps 1000 --logging_steps 1000 \
                      --model_type es --do_train --do_eval

With FPTAI/vibert-base-cased pre-trained model.

python run_bert_em.py --model_dir ./models/original_train_dev/bert_em_es_fptbert_maxlen_384_epochs_10 \
                      --model_name_or_path FPTAI/vibert-base-cased \
                      --train_data_file ./data/cached_features/train.txt  \
                      --eval_data_file ./data/cached_features/dev.txt \
                      --id2label ./data/cached_features/id2label.txt  \
                      --train_batch_size 8 --gradient_accumulation_steps 2 \
                      --save_steps 1000 --logging_steps 1000 \
                      --model_type es --do_train --do_eval

With vinai/phobert-case pre-trained models.

python run_phobert_em.py --model_dir ./models/original_train_dev/phobert_em_es_base_maxlen_384_epochs_10 \
                         --train_data_file ./data/cached_features/train-phobert.txt \
                         --eval_data_file ./data/cached_features/dev-phobert.txt \
                         --id2label ./data/cached_features/id2label.txt \
                         --train_batch_size 8 --gradient_accumulation_steps 2 \
                         --save_steps 1000 --logging_steps 1000 \
                         --model_type es --do_train --do_eval

R-BERT

With NlpHUST/vibert4news-base-cased pre-trained model.

python run_rbert.py --model_dir ./models/original_train_dev/bert_em_es_bert4news_maxlen_384_epochs_10 \
                    --model_name_or_path NlpHUST/vibert4news-base-cased \
                    --train_data_file ./data/cached_features/train.txt \
                    --eval_data_file ./data/cached_features/dev.txt \
                    --id2label ./data/cached_features/id2label.txt \
                    --train_batch_size 8 --gradient_accumulation_steps 2 \
                    --save_steps 1000 --logging_steps 1000 \
                    --do_train --do_eval

With FPTAI/vibert-base-cased pre-trained model.

python run_rbert.py --model_dir ./models/original_train_dev/rbert_fptbert_maxlen_384_epochs_10 \
                    --model_name_or_path FPTAI/vibert-base-cased \
                    --train_data_file ./data/cached_features/train.txt \
                    --eval_data_file ./data/cached_features/dev.txt \
                    --id2label ./data/cached_features/id2label.txt \
                    --train_batch_size 8 --gradient_accumulation_steps 2 \
                    --save_steps 1000 --logging_steps 1000 \
                    --do_train --do_eval

With vinai/phobert-base

python run_phobert_rbert.py --model_dir ./models/original_train_dev/phobert_rbert_base_maxlen_384_epochs_10 \
                            --train_data_file ./data/cached_features/train-phobert.txt \
                            --eval_data_file ./data/cached_features/dev-phobert.txt \
                            --id2label ./data/cached_features/id2label.txt \
                            --train_batch_size 8 --gradient_accumulation_steps 2 \
                            --save_steps 1000 --logging_steps 1000 \
                            --do_train --do_eval

Ensemble models

With NlpHUST/vibert4news-base-cased pre-trained model.

python run_ensemble.py --input_file ./data/cached_features/dev.txt

With NlpHUST/vibert4news-base-cased pre-trained model.

python run_ensemble_fptbert.py --input_file ./data/cached_features/dev.txt

With vinai/phobert-base

python python run_ensemble_phobert.py --input_file ./data/cached_features/dev-phobert.txt

vlsp2020-relex's People

Contributors

minhpqn avatar

Stargazers

Le Minh Hoang  avatar Alex Nguyen avatar

Watchers

James Cloos avatar  avatar

Forkers

itachi176

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.