Light

honghanhh / terminology-extraction Goto Github PK

View Code? Open in Web Editor NEW

4.0 5.0 1.0 117.06 MB

Terminology extraction on ACTER using Transformer-based language models

Jupyter Notebook 89.79% Python 9.30% Makefile 0.01% Shell 0.01% HTML 0.04% Ruby 0.01% SCSS 0.85%

natural-language-processing terminology-extraction bert roberta xlnet distilbert camembert iate-filtering class-weighting

terminology-extraction's Introduction

Terminology Extraction

We experiment with different Transformer-based pre-trained language models for the terminology extraction task, namely XLNet, BERT, DistilBERT, and RoBERTa, with additional techniques, including class weighting in order to reduce the significant imbalance in the training data, as well as rule-based term expansion and filtering. Our experiments are conducted on the ACTER dataset covering 3 languages and 3 domains. The results prove to be competitive on English and French, and the proposed approach outperforms the state of the art (SOTA) on Dutch.

1. ACTER

The dataset structure as well as the distributions of terms per domain per language are demonstrated in data.exploration.ipynb

2. Models and Architecture

2.1 Models

For each language, we examine several pretrained language models using SimpleTransformers as the following table.

Model	English dataset	French dataset	Dutch dataset
Multilingual BERT (uncased)	x	x	x
Multilingual BERT (cased)	x	x	x
Monolingual English BERT (uncased)	x
Monolingual English BERT (cased)	x
RoBERTa	x
DistilBERT (uncased)	x
DistilBERT (cased)	x
Multilingual DistilBERT (cased)			x
XLNet	x
CamemBERT		x

2.2 Architecture

The worflow of our implementation:

The code insides ./core_model/ is an example of how we implemented on French dataset on CamemBERT. Preferable to run on Google Collab to take advantage of GPU (in case your local machine does not support).

3. Results

All the saved prediction results of mentionned pretrained models on 3 languages are saved in folder ./results/weighted_results/.

4. References

Contributors:

terminology-extraction's People

Contributors

Stargazers

Watchers

Forkers

trellixvulnteam

terminology-extraction's Issues

Apply BERT on ACTER dataset

Reimplement baseline of TermEval 2020's ATE
Ref: TALN-LS2N System for Automatic Term Extraction
- BERT and its invariants

Reproduce the scores

Hi, I want to thank you for sharing the code and formulate this task into sequence labeling.
As the result shows, it outperforms all the participants mentioned in the original paper.

I have cloned your code and tried to reproduce the scores you provided in the table, however, I could not reproduce such good performance. I only changed the model and model type to bert and bert-base-uncased. Then, add model arguments as you mentioned in Readme. Should I adjust anything that you didn't mention in the description?

Best.

Preprocessing test set

Hey Hahn,

Went through the code and found one thing worth checking :) . In preprocessing.py (in line 89) you remove the sentences that do not have terms for the train set, right? Do you do the same for the test set on which you apply the trained model? If so, I imagine the classifier would have an easier job of only predicting terms for sentences containing terms? Predictions for sentences that have no terms inside would on the other hand probably produce some false positives that would be added to the final list and decrease overall precision?

Analyze ACTER dataet

Dataset: ACTER
- Languages
- Domains
- Term Distributions/Frequencies/Ratio
  ...

Paper summary

Task 1: Read papers

Evaluation metrics

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs