GithubHelp home page GithubHelp logo

honghanhh / terminology-extraction Goto Github PK

View Code? Open in Web Editor NEW
4.0 5.0 1.0 117.06 MB

Terminology extraction on ACTER using Transformer-based language models

Jupyter Notebook 89.79% Python 9.30% Makefile 0.01% Shell 0.01% HTML 0.04% Ruby 0.01% SCSS 0.85%
natural-language-processing terminology-extraction bert roberta xlnet distilbert camembert iate-filtering class-weighting

terminology-extraction's Introduction

Terminology Extraction

We experiment with different Transformer-based pre-trained language models for the terminology extraction task, namely XLNet, BERT, DistilBERT, and RoBERTa, with additional techniques, including class weighting in order to reduce the significant imbalance in the training data, as well as rule-based term expansion and filtering. Our experiments are conducted on the ACTER dataset covering 3 languages and 3 domains. The results prove to be competitive on English and French, and the proposed approach outperforms the state of the art (SOTA) on Dutch.

1. ACTER

The dataset structure as well as the distributions of terms per domain per language are demonstrated in data.exploration.ipynb


2. Models and Architecture

2.1 Models

For each language, we examine several pretrained language models using SimpleTransformers as the following table.

Model English dataset French dataset Dutch dataset
Multilingual BERT (uncased) x x x
Multilingual BERT (cased) x x x
Monolingual English BERT (uncased) x
Monolingual English BERT (cased) x
RoBERTa x
DistilBERT (uncased) x
DistilBERT (cased) x
Multilingual DistilBERT (cased) x
XLNet x
CamemBERT x

2.2 Architecture

The worflow of our implementation:

Workflow

The code insides ./core_model/ is an example of how we implemented on French dataset on CamemBERT. Preferable to run on Google Collab to take advantage of GPU (in case your local machine does not support).


3. Results

Term Results

All the saved prediction results of mentionned pretrained models on 3 languages are saved in folder ./results/weighted_results/.


4. References


Contributors:

terminology-extraction's People

Contributors

dependabot[bot] avatar honghanhh avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

trellixvulnteam

terminology-extraction's Issues

Reproduce the scores

Hi, I want to thank you for sharing the code and formulate this task into sequence labeling.
As the result shows, it outperforms all the participants mentioned in the original paper.

I have cloned your code and tried to reproduce the scores you provided in the table, however, I could not reproduce such good performance. I only changed the model and model type to bert and bert-base-uncased. Then, add model arguments as you mentioned in Readme. Should I adjust anything that you didn't mention in the description?

Best.

Preprocessing test set

Hey Hahn,

Went through the code and found one thing worth checking :) . In preprocessing.py (in line 89) you remove the sentences that do not have terms for the train set, right? Do you do the same for the test set on which you apply the trained model? If so, I imagine the classifier would have an easier job of only predicting terms for sentences containing terms? Predictions for sentences that have no terms inside would on the other hand probably produce some false positives that would be added to the final list and decrease overall precision?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.