GithubHelp home page GithubHelp logo

djstrong / polish-roberta Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sdadas/polish-roberta

0.0 1.0 0.0 98 KB

RoBERTa models for Polish

License: GNU Lesser General Public License v3.0

Python 98.00% Shell 2.00%

polish-roberta's Introduction

Polish RoBERTa

This repository contains pre-trained RoBERTa models for Polish as well as evaluation code for several Polish linguistic tasks. The released models were trained using Fairseq toolkit in the National Information Processing Institute, Warsaw, Poland. We provide two models based on BERT base and BERT large architectures. Two versions of each model are available: one for Fairseq and one for Huggingface Transformers.

Model L / H / A* Batch size Update steps Corpus size Final perplexity** Fairseq Transformers
RoBERTa (base) 12 / 768 / 12 8k 125k ~20GB 3.66 v0.9.0 v2.9
RoBERTa (large) 24 / 1024 / 16 30k 50k ~135GB 2.92 v0.9.0 v2.9

* L - the number of encoder blocks, H - hidden size, A - the number of attention heads
** Perplexity of the best checkpoint, computed on the validation split

Getting started

How to use with Fairseq

import os
from fairseq.models.roberta import RobertaModel, RobertaHubInterface
from fairseq import hub_utils

model_path = "roberta_large_fairseq"
loaded = hub_utils.from_pretrained(
    model_name_or_path=model_path,
    data_name_or_path=model_path,
    bpe="sentencepiece",
    sentencepiece_vocab=os.path.join(model_path, "sentencepiece.bpe.model"),
    load_checkpoint_heads=True,
    archive_map=RobertaModel.hub_models(),
    cpu=True
)
roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])
roberta.eval()
input = roberta.encode("Zażółcić gęślą jaźń.")
output = roberta.extract_features(input)
print(output[0][1])

How to use with HuggingFace Transformers

import torch
from tokenizers import SentencePieceBPETokenizer
from tokenizers.processors import RobertaProcessing
from transformers import RobertaModel, AutoModel

model_dir = "roberta_large_transformers"
tokenizer = SentencePieceBPETokenizer(f"{model_dir}/vocab.json", f"{model_dir}/merges.txt")
getattr(tokenizer, "_tokenizer").post_processor = RobertaProcessing(sep=("</s>", 2), cls=("<s>", 0))
model: RobertaModel = AutoModel.from_pretrained(model_dir)

input = tokenizer.encode("Zażółcić gęślą jaźń.")
output = model(torch.tensor([input.ids]))[0]
print(output[0][1])

Evaluation

To replicate our experiments, first download the required datasets using download_data.py script:

python download_data.py

Next, run run_tasks.py script to prepare the data, fine-tune and evaluate the model. We used the following parameters for each task:

python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-NKJP --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-CDS-E --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-CDS-R --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 1 --tasks KLEJ-CBD --fp16 True --max-sentences 8 --update-freq 4 --resample 0:0.75,1:3
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-POLEMO-IN --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-POLEMO-OUT --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-DYK --fp16 True --max-sentences 8 --update-freq 4 --resample 0:1,1:3
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-PSC --fp16 True --max-sentences 8 --update-freq 4 --resample 0:1,1:3
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks KLEJ-ECR --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks 8TAGS --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks SICK-E --fp16 True --max-sentences 8 --update-freq 2
python run_tasks.py --arch roberta_base --model_dir roberta_base_fairseq --train-epochs 10 --tasks SICK-R --fp16 True --max-sentences 8 --update-freq 2

Evaluation results on KLEJ Benchmark

Below we show the evaluation results of our models on the tasks included in KLEJ Benchmark. We fine-tuned both models 5 times for each task. Detailed scores for each run and averaged scores are presented in Table 1 and Table 2.

Run NKJP CDSC‑E CDSC‑R CBD PolEmo‑IN PolEmo‑OUT DYK PSC AR Avg
1 93.15 93.30 94.26 66.67 91.97 78.74 66.86 98.63 87.75 85.70
2 93.93 94.20 93.94 68.16 91.83 75.91 65.93 98.77 87.93 85.62
3 94.22 94.20 94.04 69.23 90.17 76.92 65.69 99.24 87.76 85.72
4 93.97 94.70 93.98 63.81 90.44 76.32 65.18 99.39 87.58 85.04
5 93.63 94.00 93.96 65.95 90.58 74.09 65.92 98.48 87.08 84.85
Avg 93.78 94.08 94.04 66.77 91.00 76.40 65.92 98.90 87.62 85.39

Table 1. KLEJ results for RoBERTa base model

Run NKJP CDSC‑E CDSC‑R CBD PolEmo‑IN PolEmo‑OUT DYK PSC AR Avg
1 94.31 93.50 94.63 72.39 92.80 80.54 71.87 98.63 88.82 87.50
2 95.14 93.90 94.93 69.82 92.80 82.59 73.39 98.94 88.96 87.83
3 95.24 93.30 94.61 71.59 91.41 82.19 75.35 98.64 89.31 87.96
4 94.46 93.20 94.96 71.08 92.80 82.39 70.59 99.09 88.60 87.46
5 94.46 93.00 94.82 69.83 92.11 83.00 74.85 98.79 88.65 87.72
Avg 94.72 93.38 94.79 70.94 92.38 82.14 73.21 98.82 88.87 87.69

Table 2. KLEJ results for RoBERTa large model

Evaluation results on other tasks

Task Task type Metric Base model Large model
SICK-E Textual entailment Accuracy 86.13 87.67
SICK-R Semantic relatedness Spearman correlation 82.26 85.63
Poleval 2018 - NER Named entity recognition F1 score (exact match) 87.94 89.98
8TAGS Multi class classification Accuracy 77.22 80.84

polish-roberta's People

Contributors

djstrong avatar sdadas avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.