efemeryds / offensive-language-detection Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 3.06 MB

Basic approach to the offensive language detection and checklist tests

Python 0.36% Jupyter Notebook 99.38% PowerShell 0.22% Batchfile 0.04%

offensive-language-detection's People

Contributors

Watchers

Forkers

melisnv

offensive-language-detection's Issues

BONUS

Develop 2 new diagnostic tests (you can use checklist): describe what they test, explain why
they are relevant and implement them. Run the tests and describe your observations. Provide
examples of difficult cases, that is, when the model fails to assign the correct label. Discuss
potential sources of errors and propose improvements to the model.

PART A - 1. Class distributions

Load the training set (olid-train.csv) and analyze the number of instances for each of the two classification labels.

PART A - 4. Inspect the tokenization of the OLIDv1 training set using the BERT’s tokenizer

PART A - 3. Classification by fine-tuning BERT

Run your notebook on colab, which has (limited) free access to GPUs.
You need to enable GPUs for the notebook:

● navigate to Edit → Notebook Settings
● select GPU from the Hardware Accelerator drop-down

➢ Install the simpletransformers library: !pip install simpletransformers
(you will have to restart your runtime after the installation)

➢ Follow the documentation to load a pre-trained BERT model: ClassificationModel('bert',
'bert-base-cased')

➢ Fine-tune the model on the OLIDv1 training set and make predictions on the OLIDv1 test
set (you can use the default hyperparameters). Do not forget to save your model, so that
you do not need to fine-tune the model each time you make predictions.
If you cannot fine-tune your own model, contact us to receive a checkpoint.
a. Provide the results in terms of precision, recall and F1-score on the test set and provide
a confusion matrix (2 points)

Compare your results to the baselines and to the results described in the paper in 2–4
sentences

PART A - 2. Baselines

Calculate two baselines and evaluate their performance on the test set (olid-test.csv):

● The first baseline is a random baseline that randomly assigns one of the 2 classification
labels.

● The second baseline is a majority baseline that always assigns the majority class.
Calculate the results on the test set and fill them into the two tables below. Round the results to
two decimals.

efemeryds / offensive-language-detection Goto Github PK

offensive-language-detection's People

Contributors

Watchers

Forkers

offensive-language-detection's Issues

BONUS

PART A - 1. Class distributions

PART A - 4. Inspect the tokenization of the OLIDv1 training set using the BERT’s tokenizer

PART A - 3. Classification by fine-tuning BERT

PART A - 2. Baselines

PART B - 6. Negation

PART 0 - Read slides and articles from week 3

PART B - 7. Creating negated examples

PART 0 - Read the article about dataset collection

PART B - 5. Typos

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs