GithubHelp home page GithubHelp logo

dwlg's Introduction

DWLG

This repository contains scripts for scraping and converting data from German language courses on the Deutsche Welle โ€“ Learn German website, and for reproducing the DWLG dataset.

TL;DR

To reproduce the DWLG dataset, the easiest way is to use the Docker image on DockerHub. You only need to install Docker and run the following commands.

mkdir data
docker run --rm -v "$PWD/data:/app/data" saeub/dwlg

The raw data will first be downloaded into data/raw/ and then extracted, converted, and split into data/splits/{train,dev,test}.jsonl.

Requirements for running without Docker

If you don't want to use Docker, or if you want to modify the source code, the following steps are required:

  • Install Python >= 3.10 as well as Google Chrome and ChromeDriver.
  • Clone this repository: git clone https://github.com/saeub/dwlg
  • Install Python dependencies: pip install -r requirements.txt (virtual environment recommended)

Scripts

scrape_and_extract.sh

Usage with Docker: docker run --rm -v "$PWD/data:/app/data" saeub/dwlg (folder data/ needs to exist in working directory)

Usage without Docker: bash scrape_and_extract.sh

This script runs scrape.py and extract.py to reproduce the DWLG dataset. Data will be stored in data/.

scrape.py

Usage with Docker: docker run --rm -v "$PWD/data:/app/data" saeub/dwlg python scrape.py

Usage without docker: python scrape.py

This script downloads all the "Top-Thema" lessons and saves the raw data under data/raw/.

NOTE: scrape.py will not re-download lessons that already exists under data/raw/. If you want to re-download everything, remove all the files in data/raw/.

extract.py

Usage with docker: docker run --rm -v "$PWD/data:/app/data" saeub/dwlg python extract.py <HASH-FILE.json> <SPLIT-NAME>

Usage without docker: python extract.py <HASH-FILE.json> <SPLIT-NAME>

This script needs to be run after scrape.py and converts the raw data into the DWLG JSONL format.

  • <HASH-FILE.json> is the path to a JSON file containing the IDs and hashes of the lessons to be extracted. The hash files for DWLG are available in splits/. Use python extract.py splits/all.json all to extract all available data (including the most recent lessons that are not part of DWLG).
  • <SPLIT-NAME> is the name of the split. The resulting JSONL file will be stored as data/splits/<SPLIT-NAME>.jsonl.

NOTE: The script will warn you about hash mismatches when the data you scraped does not match the data in DWLG. The text or items in these lessons may have been changed since the latest version of DWLG. See the change log below for known changes.

Building the docker image

Instead of pulling the image from DockerHub, you can build it yourself:

docker build -t saeub/dwlg .

DWLG change log

The following list contains known changes that Deutsche Welle made to courses that are part of the DWLG dataset (since May 2023). These changes are not reflected in the hash files in splits/ and will therefore cause a hash mismatch when running the extract.py script:

  • Lesson 64273452 (dev): A missing whitespace was inserted.

dwlg's People

Contributors

saeub avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.