GithubHelp home page GithubHelp logo

crosslingual-genre-bias-analysis's Introduction

Analysis of genre prediction in CLASSLA-web and MaCoCu corpora

Final test corpus which consists of all manually evaluated corpora is: manual-annotations/multilingual-genre-annotated-test-set.json

It is a JSON file with the following languages as keys. Each language item then consists of the following values: - accuracy, micro_f1, macro_f1: the evaluation results in accuracy, micro F1 and macro F1 metric - label_scores: per label F1 scores - dataset: the test dataset. The dataset can be opened as a pandas dataframe: pd.DataFrame(json_dict["lang"]["dataset"]). It includes: - the automatically predicted labels (y_pred), - the manually-evaluated labels (y_true), - text, - translation (the translation to English which was used for manual evaluation), - text_id (the same as in the original MaCoCu or CLASSLA-web corpora) - metadata - list of tokens (tokens) and their token ids (token_ids) - token_overlap: includes the following keys: - overlap_percentage: percentage of overlap with the train dataset (X-GENRE-train) - token_list: list of all tokens from the dataset - overlap_token_list: list of all tokens that appear both in train dataset and test set - label_level_token_lists: list of all tokens from the dataset for each of the labels

Automatically annotated MaCoCu corpora

For automatic annotation, we only annotate texts that are longer (or the same size) than 75 words. Furthermore, due to model's limitations, we only take the first 512 words of each text (because the max_sequence_length that model can take is 512 tokens).

See the spreadsheet CLASSLA genres for all sizes and genre distributions.

We use the code 1-select_for_xgenre.py and 2-predict_extended.py in /cache/tajak/macocu-mt/. The automatically annotated datasets are saved in /cache/tajak/macocu-mt/datasets/annotated.

See the code in copied-code-from-cache and the README.md file in the same folder for all documentation on how the corpora were annotated and samples were prepared. After annotation the following steps were done:

  • Evaluate the sample after annotation: /home/tajak/Crosslingual-Genre-Bias-Analysis/evaluation-of-annotation.ipynb
  • If needed, annotate additional instances: see section Add additional instances to the sample to achieve 10 instances per label in /cache/tajak/macocu-mt/analyze-entire-file-prepare-sample.ipynb to prepare additional instances; and section Add additionally annotated instances in /home/tajak/Crosslingual-Genre-Bias-Analysis/evaluation-of-annotation.ipynb to merge them with initial sample and get final evaluations.

Output:

  • manual-annotations/MaCoCu-{lang}-genre-sample-evaluated-complete-sample.jsonl - manually evaluated samples (or MaCoCu-{lang}-genre-sample-evaluated-complete-sample-run2.jsonl in case there were two rounds of adding additional instances)- initial samples + additional instances (so that each label was evaluated on 10 instances); Multiple texts and Other texts are marked, but included in the sample - use the code in /home/tajak/Crosslingual-Genre-Bias-Analysis/evaluation-of-annotation.ipynb to discard them and evaluate them.

Genre sample preparation

The genre samples are prepared in such way that we take 10 random instances of each genre from the entire corpus. We do not include the genres "Other" and "Mix", as they are not informative - the final sample size is 80 instances. If the manual evalutation shows that some instances had to be discarded, because they were "Multiple texts", "Incomprehensible" or would belong to the category "Other", we annotated additional instances so that each label was annotated on 10 instances.

Then we translate the source text in the sample to English using Google Translate. Google Translate provides models for all of our languages, except for Montenegrin, for which we used the Serbian MT model.

Analyze the results

Results

Dataset Macro F1 Micro F1
MaCoCu-uk 0.948 0.950
CLASSLA.web-sl 0.936 0.938
CLASSLA.web-mk 0.932 0.925
MaCoCu-tr 0.899 0.9
CLASSLA.web-hr 0.883 0.887
MaCoCu-sq 0.854 0.850
MaCoCu-el 0.844 0.850
MaCoCu-ca 0.827 0.825
MaCoCu-is 0.795 0.8
MaCoCu-mt 0.488 0.557
mt el tr sq is uk ca mk hr sl average average-wth-mt
News 0.69 0.9 0.95 0.89 0.73 1 0.82 0.91 0.95 0.9 0.87 0.89
Opinion/Argumentation 0.33 0.87 0.82 0.67 0.82 0.91 0.84 0.78 0.78 0.82 0.76 0.81
Instruction 0.69 0.71 0.9 0.95 0.78 0.95 0.75 1 0.75 1 0.85 0.86
Information/Explanation 0.52 0.7 0.82 0.8 0.5 1 0.72 0.84 0.95 0.9 0.78 0.8
Promotion 0.82 0.62 0.86 0.95 0.75 0.78 0.78 0.95 0.87 1 0.84 0.84
Forum 0.18 0.95 0.89 0.78 0.95 0.95 0.84 1 0.91 1 0.84 0.91
Prose/Lyrical 0.18 1 0.95 0.86 1 1 0.91 0.95 0.86 0.91 0.86 0.93
Legal nan 1 1 0.95 0.84 1 0.95 0.95 1 0.95 0.96 0.96
Macro F1 0.49 0.84 0.9 0.85 0.8 0.95 0.83 0.92 0.88 0.94 0.84 0.87

CLASSLA.web-sl

Macro f1: 0.936, Micro f1: 0.938, Accuracy: 0.938

precision recall f1-score support
Forum 1 1 1 10
Information/Explanation 0.9 0.9 0.9 10
Instruction 1 1 1 10
Legal 1 0.909091 0.952381 11
News 0.9 0.9 0.9 10
Opinion/Argumentation 0.7 1 0.823529 7
Promotion 1 1 1 10
Prose/Lyrical 1 0.833333 0.909091 12

CLASSLA.web-hr

Macro f1: 0.883, Micro f1: 0.887, Accuracy: 0.887

precision recall f1-score support
Forum 1 0.833333 0.909091 12
Information/Explanation 1 0.909091 0.952381 11
Instruction 0.6 1 0.75 6
Legal 1 1 1 10
News 0.9 1 0.947368 9
Opinion/Argumentation 0.7 0.875 0.777778 8
Promotion 1 0.769231 0.869565 13
Prose/Lyrical 0.9 0.818182 0.857143 11

Corpus: CLASSLA.web-mk

Macro f1: 0.923, Micro f1: 0.925, Accuracy: 0.925

precision recall f1-score support
Forum 1 1 1 10
Information/Explanation 0.8 0.888889 0.842105 9
Instruction 1 1 1 10
Legal 0.9 1 0.947368 9
News 1 0.833333 0.909091 12
Opinion/Argumentation 0.7 0.875 0.777778 8
Promotion 1 0.909091 0.952381 11
Prose/Lyrical 1 0.909091 0.952381 11

Corpus: MaCoCu-mt

Macro f1: 0.488, Micro f1: 0.557 Accuracy: 0.557

precision recall f1-score support
Forum 0.1 1 0.181818 1
Information/Explanation 0.6 0.461538 0.521739 13
Instruction 1 0.526316 0.689655 19
News 0.9 0.5625 0.692308 16
Opinion/Argumentation 0.3 0.375 0.333333 8
Promotion 0.9 0.75 0.818182 12
Prose/Lyrical 0.1 1 0.181818 1

MaCoCu-sq

Macro f1: 0.854, Micro f1: 0.85 Accuracy: 0.85

precision recall f1-score support
Forum 0.7 0.875 0.777778 8
Information/Explanation 1 0.666667 0.8 15
Instruction 0.9 1 0.947368 9
Legal 0.9 1 0.947368 9
News 0.8 1 0.888889 8
Opinion/Argumentation 0.7 0.636364 0.666667 11
Promotion 0.9 1 0.947368 9
Prose/Lyrical 0.9 0.818182 0.857143 11

MaCoCu-is

Macro f1: 0.795, Micro f1: 0.8 Accuracy: 0.8

precision recall f1-score support
Forum 0.9 1 0.947368 9
Information/Explanation 0.4 0.666667 0.5 6
Instruction 0.7 0.875 0.777778 8
Legal 0.8 0.888889 0.842105 9
News 0.8 0.666667 0.727273 12
Opinion/Argumentation 0.9 0.75 0.818182 12
Promotion 0.9 0.642857 0.75 14
Prose/Lyrical 1 1 1 10

MaCoCu-el

Macro f1: 0.844, Micro f1: 0.85, Accuracy: 0.85

precision recall f1-score support
Forum 1 0.909091 0.952381 11
Information/Explanation 0.8 0.615385 0.695652 13
Instruction 0.6 0.857143 0.705882 7
Legal 1 1 1 10
News 0.9 0.9 0.9 10
Opinion/Argumentation 1 0.769231 0.869565 13
Promotion 0.5 0.833333 0.625 6
Prose/Lyrical 1 1 1 10

MaCoCu-ca

Macro f1: 0.827, Micro f1: 0.825 Accuracy: 0.825

precision recall f1-score support
Forum 0.8 0.888889 0.842105 9
Information/Explanation 0.9 0.6 0.72 15
Instruction 0.6 1 0.75 6
Legal 0.9 1 0.947368 9
News 0.7 1 0.823529 7
Opinion/Argumentation 0.8 0.888889 0.842105 9
Promotion 0.9 0.692308 0.782609 13
Prose/Lyrical 1 0.833333 0.909091 12

MaCoCu-tr

Macro f1: 0.899, Micro f1: 0.9 Accuracy: 0.9

precision recall f1-score support
Forum 0.8 1 0.888889 8
Information/Explanation 0.7 1 0.823529 7
Instruction 0.9 0.9 0.9 10
Legal 1 1 1 10
News 1 0.909091 0.952381 11
Opinion/Argumentation 0.9 0.75 0.818182 12
Promotion 0.9 0.818182 0.857143 11
Prose/Lyrical 1 0.909091 0.952381 11

MaCoCu-uk

Macro f1: 0.948, Micro f1: 0.95 Accuracy: 0.95

precision recall f1-score support
Forum 0.9 1 0.947368 9
Information/Explanation 1 1 1 10
Instruction 1 0.909091 0.952381 11
Legal 1 1 1 10
News 1 1 1 10
Opinion/Argumentation 1 0.833333 0.909091 12
Promotion 0.7 0.875 0.777778 8
Prose/Lyrical 1 1 1 10

More information on sample evaluation

First batch of languages: Slovenian, Croatian, Macedonian and Albanian

For the sample, I randomly sampled 10 instances of each of the genre classes from the first 100.000 texts in the corpora -> 90 instances per corpus. I included "Other" as the label in the sample. However, as this label is mostly used so that the classifier can use it for harder examples, when doing manual annotation, I tried to identify the actual label of these texts, so most of texts, labelled Other, were manually annotated as something else.

I evaluated three corpora: CLASSLA-web.sl, CLASSLA-web.hr and CLASSLA-web.mk. After the two rounds of evaluation of these three corpora, I also evaluated the Albanian corpus: MaCoCu-sq.

Second batch: extended evaluation to all other MaCoCu corpora

The sample was prepared in the same way, except for the fact that we randomly sampled the texts from the entire corpus (not from the first 100.000 as in the first batch). Second difference is that we did not include the "Other" label in the sample, because this is used as a "throw-away" category to be used when the classifier doesn't recognize the text to be of any other, more concrete genres.

Label distribution (y_true)

This is the distribution after the additionally evaluated instances were added.

For calculating the metrics of classifier's performance, I remove "Other" texts and "Multiple texts" (also "Incomprehensible" in case of Albanian) texts from the sample. Thus, we compare only the predictions of 8 labels, each having 10 instances.

Initial distribution (before post-processing) - OLD RESULTS (before cleaning non-target languages):

MaCoCu-mt:

y_true count
Instruction 19
News 16
Information/Explanation 13
Promotion 12
Legal 10
Opinion/Argumentation 8
Multiple texts (3%) 2
Forum 1
Prose/Lyrical 1

MaCoCu-el:

y_true count
Opinion/Argumentation 13
Information/Explanation 13
Forum 11
Prose/Lyrical 10
Legal 10
News 10
Instruction 7
Promotion 6
Multiple texts (7%) 6

MaCoCu-tr:

y_true count
Opinion/Argumentation 12
News 11
Prose/Lyrical 11
Promotion 11
Instruction 10
Legal 10
Forum 8
Information/Explanation 7
Multiple texts (4.7%) 4
Other (1.17%) 1

MaCoCu-sq:

y_true count
Information/Explanation 17
Opinion/Argumentation 12
Forum 12
Prose/Lyrical 12
Legal 9
Promotion 9
Instruction 9
News 8
Other (4.12%) 4
Multiple texts (3.1%) 3
Incomprehensible (2.1%) 2

MaCoCu-is:

y_true count
Promotion 13
Opinion/Argumentation 12
News 12
Prose/Lyrical 10
Legal 9
Forum 9
Instruction 8
Information/Explanation 7
Multiple texts (7.95%) 7
Incomprehensible (1.14%) 1

MaCoCu-ca:

y_true count
Information/Explanation 15
Promotion 13
Prose/Lyrical 12
Forum 9
Legal 9
Opinion/Argumentation 9
News 7
Instruction 6
Multiple texts (3.5%) 3
Incomprehensible (2.35%) 2

MaCoCu-uk:

y_true count
Opinion/Argumentation 12
Instruction 11
Prose/Lyrical 10
Legal 10
News 10
Information/Explanation 10
Forum 9
Promotion 8
Multiple texts (9%) 8

In CLASSLA-web corpora, we initially also annotated the "Other" labels, that is why they are present here more than in other corpora.

CLASSLA-mk:

y_true count
Promotion 13
News 12
Prose/Lyrical 11
Opinion/Argumentation 11
Information/Explanation 11
Forum 10
Instruction 10
Legal 9
Multiple texts (8.3%) 8
Other (1%) 1

CLASSLA-hr:

y_true count
Promotion 16
Prose/Lyrical 12
Forum 12
Information/Explanation 11
Legal 10
News 9
Opinion/Argumentation 8
Multiple texts (7.2%) 7
Other (6.19%) 6
Instruction 6
y_true count
Promotion 13
Prose/Lyrical 13
Legal 11
Information/Explanation 11
News 10
Instruction 10
Forum 10
Multiple texts (8.25%) 8
Opinion/Argumentation 8
Other (3.1%) 3

Number of texts, annotated as problematic ("multiple texts") - mostly, they were not a coherent text (just a list of summaries, multiple texts concatenated):

  • Slovenian, Croatian, Macedonian, Icelandic, Greek, Ukrainian: 6-9%
  • Albanian, Maltese, Turkish, Catalan: 3-5% - there were less problematic texts. However, in Albanian sample, there were also some incomprehensible texts - probably due to bad machine translation - 2% of texts.

Improved sample evaluation - comparison with the first run

I re-did the evaluation on the same dataset, but after some improvements: the text was clearly separated into paragraphs -> texts are much more comprehensible and easier to understand; I used the doccano annotation platform -> easier and (slightly) faster annotation.

When I compared my annotations from the two runs, there were disagreements between the labels in 40 instances - 15% of instances.

The reasons for the differences:

reason frequency (# instances)
improved second run (text annotated with a label, instead of "problematic") 8
detected "multiple texts" that I didn't in the first run 9
inter-annotator disagreement 23

So we can see that the improved methodology (better shown texts, using doccano and paragraph structure) improved annotation of 17 cases - 6% of all texts (either we were able to annotate texts that were previously incomprehensible, or we detected problematic texts that we previously didn't due to the lacking text structure).

In 23 instances (9%), there was inter-annotator disagreement, which shows the level of difficulty of this task. However, 8 of these cases appeared when annotating texts that were predicted as "Other", which we decided that we won't include in the annotation campaign anyway. If we disregard these cases, disagreement happened in 15 instances - 6% of texts.

crosslingual-genre-bias-analysis's People

Contributors

tajakuzman avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.