Script Normalization for Unconventional Writing

[📑 ALC 2023 Paper] [📝 Slides] [📽️ Presentation] [📀 Datasets] [⚙️ Demo]

This repository contains the data and the models described in the ACL2023 paper "Script Normalization for Unconventional Perso-Arabic Writing". The models are deployed on HuggingFace: Demo 🔥

What is unconventional writing?

"mar7aba!"
"هاو ئار یوو؟"
"Μπιάνβενου α σε προζέ!"

What do all these sentences have in common? Being greeted in Arabic with "mar7aba" written in the Latin script, then asked how you are ("هاو ئار یوو؟") in English using the Perso-Arabic script of Kurdish and then, welcomed to this demo in French ("Μπιάνβενου α σε προζέ!") written in Greek script. All these sentences are written in an unconventional script.

Although you may find these sentences risible, unconventional writing is a common practice among millions of speakers in bilingual communities. In our paper entitled "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities", we shed light on this problem and propose an approach to normalize noisy text written in unconventional writing.

This repository provides codes and datasets that can be used to reproduce our paper or extend it to other languages. The focus of the current project is on some of the main languages that use a Perso-Arabic script, namely the followings:

Azeri Turkish (aka South Azerbaijani / azb)
Kashmiri (kas)
Gilaki (glk)
Gorani (aka Hawrami, hac)
Northern Kurdish (aka Kurmanji, kmr)
Central Kurdish (aka Sorani, ckb)
Mazanderani (mzn)
Sindhi (snd)
Persian (fas)
Arabic (arb)
Urdu (urd)

Please note that this project does not aim for spell-checking and cannot correct errors beyond character normalization.

Corpora

The data presented in the corpus folder have been extracted from Wikipedia dumps and cleaned using wikiextractor, unless the name of the file doesn't include wiki (followed by the date of the dump). Here are the sources of the material of the other languages:

Kurmanji (kmr-arb) contains crawled text collected from the following websites: rojnameyaevro.com, www.gavtv.net and https://badinan.org/
Gorani (hac-arb) is based on the Zaza-Gorani corpus.
Sorani (ckb-arb) is based on Pewan and AsoSoft corpus corpora.

All the corpora are cleaned to a decent extent.

Wordlists

Wordlists in the wordlist folder contain words that are extracted from the corpora based on certain frequency. Depending on the size and quality of the data, the frequency is in the range of 3 to 10, i.e. words that appear with a frequency of 3 to 10 are extracted as the vocabulary of the language.

To extract words from the corpora, run the following:

```
cat <file> |  tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -n > <file_wordlist>
```

Following this, common words in the source and target languages are identified and stored in the common folder. For the target languages, dictionaries are used. This folder contains two sets of common words, whether written with the same spelling or slightly different, and is organized in two sub-folders:

corpus-based contains files of common words in two languages based on a corpus
dictionary-based contains files of common words in two languages extracted from dictionaries

If the source language has a dictionary, the common words are provided in the dictionary-based folder. Otherwise, check the corpus-based folder. The merged vocabularies (based on corpus and dictionary) are provided in the common folder. If a dictionary is not available for the source language, the common files in common are the same as those at corpus-based.

Scripts

Information about the target scripts can be found at data/scripts as follows:

Language	Target Script	Mapping
Kashmiri	Urdu	`data/scripts/Kashmiri-Urdu.tsv`
Sindhi	Urdu	`data/scripts/Sindhi-Urdu.tsv`
Mazanderani	Persian	`data/scripts/Mazanderani-Persian.tsv`
Gilaki	Persian	`data/scripts/Gilaki-Persian.tsv`
AzeriTurkish	Persian	`data/scripts/AzeriTurkish-Persian.tsv`
Gorani	Kurdish	`data/scripts/Gorani-Kurdish.tsv`
Gorani	Arabic	`data/scripts/Gorani-Arabic.tsv`
Gorani	Persian	`data/scripts/Gorani-Persian.tsv`
Kurdish	Arabic	`data/scripts/Kurdish-Arabic.tsv`
Kurdish	Persian	`data/scripts/Kurdish-Persian.tsv`

Also, find more meta-data about the usage of diacritics and zero-width non-joiner (ZWNJ) in each language at data/script/info.json. A mapping of all the scripts is also provided at data/scripts/scripts_all.tsv.

Character-alignment matrix (CAT)

Calculating the edit distance based on the wordlists, a character alignment matrix (CAT) is created for each source-target language pair. This matrix contains the normalized probability that a character in a language appears as the equivalent of another one in the other language, i.e. compare the letter 'ج' in بۆرج in Azeri Turkish with برج in Farsi.

In addition to the edit distance, if there are rule-based mappings in the data/scripts folder, the CAT is updated accordingly (by adding 1 for each mapping). Finally, any replacement with a score < 0.1 is removed from the matrix.

Datasets

The synthetic datasets are available on GDrive due to large size (2.39G in .tar.gz): link. The real data in Central Kurdish written unconventionally in Arabic and Persian can be found at data/real.

Create your own pipeline!

If you are interested in this project and want to extend it, here are the steps to consider:

Add your corpus to the data/corpus folder
Update the code/config.json file and specify directories to your data and other required files.
Run extract_loanwords.py to extract common words (script should be optimized!)
Add script mapping in TSV format to the data/scripts folder.
Run create_CAT.py to create the character-alignment matrix
Run synthesize.py to generate synthetic data.

You can use any NMT training platform of your choice for training your models. In the paper, we use joeynmt for which the configuration files are provided in the training folder. If using SLURM, you can also use the scripts in training/SLURMs.

Related Projects

Checkout the following related projects too:

Cite this paper

If you use any part of the data, please consider citing this paper as follows:

@inproceedings{ahmadi2023acl,
title = "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities",
author = "Ahmadi, Sina and Anastasopoulos, Antonios",
month = july,
year = "2023",
address = "Tornonto, Cananda",
publisher = "The 61st Annual Meeting of the Association for Computational Linguistics (ACL)"
}

License

Apache License

sinaahmadi / scriptnormalization Goto Github PK

scriptnormalization's Introduction

Script Normalization for Unconventional Writing

What is unconventional writing?

Corpora

Wordlists

Scripts

Character-alignment matrix (CAT)

Datasets

Create your own pipeline!

Related Projects

Cite this paper

License

scriptnormalization's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs