Derivative Corpus in German

Code and data for a corpus of German derivatives in context from Reddit.

The data can be downloaded from

The dataset was created to finetune a BERT model to a derivative prediction task. If you do not need the data for this purpose read the section "Derivatives ONLY"

Setup requirements

Install Python 3.9 and install dependencies from requirements.txt
Download the DeReKo-2014-II-MainArchive-STT.100000.freq file from here into the data directory

Create German derivative dataset

Download Reddit data

Download all the Reddit comments from here to the reddit folder. The files are named like RC_year_month.zst

You can use the scripts/download_ds.py script

Filter Reddit data for German content

Once the Reddit data is downloaded we have to filter for German comments. This returns a file with German comments for month/year

Download the Fasttext model lid.176.bin and change in scripts/get_german_comments.py the path to where your model is
Run scripts/get_german_comments.py

Search for derivatives

Now we can search for derivatives! This search will give you one excel table for each month/year that includes the derivatives of this month.

Run the script scripts/get_derivatives.py

The file is am excel file that contains for each derivative:

affix: affix the derivative matches with
base: base of the derivative
count: frequency of the derivative
in_lexica: True/False if the derivative is in the DeReKo lexica
stem: stem of the derivative
mode: prefixated, suffixated or both suffix and prefix
context: list of contexts the derivative appears in

Join data

We now have one table of derivatives for each month. To join the affixes

run the script scripts/join_data.py

Finetuning prep: Split data in training/test/dev set

This step splits the data into the train/test/dev set in the conditions SHARED and SPLIT. In SHARED, the data is split by context and in SPLIT by derivative. This means that in the SPLIT condition, a derivative is either in the train or test (or dev) set.

Run the script scripts/finetuning_prep.py

Derivatives ONLY

This dataset was developed to finetune a BERT model for derivative prediction. If you do not care about BERT but about derivatives, you have to deactivate that all derivatives are checked for being included into BERTs vocabulary. You can do this, for example, by always returning True in the method check_token_stem_for_bert and token_in_bert in scripts/utils_reddit.py

You also do not need the "finetuning prep" step.

gueneumann / derivativecorpusgerman Goto Github PK

derivativecorpusgerman's Introduction

Derivative Corpus in German

Setup requirements

Create German derivative dataset

Download Reddit data

Filter Reddit data for German content

Search for derivatives

Join data

Finetuning prep: Split data in training/test/dev set

Derivatives ONLY

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs