GithubHelp home page GithubHelp logo

sinaahmadi / scriptnormalization Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 57.55 MB

Script Normalization for Unconventional Writing of Perso-Arabic scripts (ACL2023)

Home Page: https://huggingface.co/spaces/SinaAhmadi/ScriptNormalization

Python 0.56% Shell 0.23% PLSQL 99.16% Jupyter Notebook 0.06%
arabic azeri gilaki gorani kashmiri kurdish kurmanji persian sindhi sorani

scriptnormalization's Introduction

Script Normalization for Unconventional Writing

Perso-Arabic scripts that are targeted in this study.
[📑 ALC 2023 Paper] [📝 Slides] [📽️ Presentation] [📀 Datasets] [⚙️ Demo]

This repository contains the data and the models described in the ACL2023 paper "Script Normalization for Unconventional Perso-Arabic Writing". The models are deployed on HuggingFace: Demo 🔥


What is unconventional writing?

  • "mar7aba!"
  • "هاو ئار یوو؟"
  • "Μπιάνβενου α σε προζέ!"

What do all these sentences have in common? Being greeted in Arabic with "mar7aba" written in the Latin script, then asked how you are ("هاو ئار یوو؟") in English using the Perso-Arabic script of Kurdish and then, welcomed to this demo in French ("Μπιάνβενου α σε προζέ!") written in Greek script. All these sentences are written in an unconventional script.

Although you may find these sentences risible, unconventional writing is a common practice among millions of speakers in bilingual communities. In our paper entitled "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities", we shed light on this problem and propose an approach to normalize noisy text written in unconventional writing.

This repository provides codes and datasets that can be used to reproduce our paper or extend it to other languages. The focus of the current project is on some of the main languages that use a Perso-Arabic script, namely the followings:

Please note that this project does not aim for spell-checking and cannot correct errors beyond character normalization.

Corpora

The data presented in the corpus folder have been extracted from Wikipedia dumps and cleaned using wikiextractor, unless the name of the file doesn't include wiki (followed by the date of the dump). Here are the sources of the material of the other languages:

All the corpora are cleaned to a decent extent.

Wordlists

Wordlists in the wordlist folder contain words that are extracted from the corpora based on certain frequency. Depending on the size and quality of the data, the frequency is in the range of 3 to 10, i.e. words that appear with a frequency of 3 to 10 are extracted as the vocabulary of the language.

To extract words from the corpora, run the following:

```
cat <file> |  tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -n > <file_wordlist>
```

Following this, common words in the source and target languages are identified and stored in the common folder. For the target languages, dictionaries are used. This folder contains two sets of common words, whether written with the same spelling or slightly different, and is organized in two sub-folders:

  • corpus-based contains files of common words in two languages based on a corpus
  • dictionary-based contains files of common words in two languages extracted from dictionaries

If the source language has a dictionary, the common words are provided in the dictionary-based folder. Otherwise, check the corpus-based folder. The merged vocabularies (based on corpus and dictionary) are provided in the common folder. If a dictionary is not available for the source language, the common files in common are the same as those at corpus-based.

Scripts

Information about the target scripts can be found at data/scripts as follows:

Language Target Script Mapping
Kashmiri Urdu data/scripts/Kashmiri-Urdu.tsv
Sindhi Urdu data/scripts/Sindhi-Urdu.tsv
Mazanderani Persian data/scripts/Mazanderani-Persian.tsv
Gilaki Persian data/scripts/Gilaki-Persian.tsv
AzeriTurkish Persian data/scripts/AzeriTurkish-Persian.tsv
Gorani Kurdish data/scripts/Gorani-Kurdish.tsv
Gorani Arabic data/scripts/Gorani-Arabic.tsv
Gorani Persian data/scripts/Gorani-Persian.tsv
Kurdish Arabic data/scripts/Kurdish-Arabic.tsv
Kurdish Persian data/scripts/Kurdish-Persian.tsv

Also, find more meta-data about the usage of diacritics and zero-width non-joiner (ZWNJ) in each language at data/script/info.json. A mapping of all the scripts is also provided at data/scripts/scripts_all.tsv.

Character-alignment matrix (CAT)

Calculating the edit distance based on the wordlists, a character alignment matrix (CAT) is created for each source-target language pair. This matrix contains the normalized probability that a character in a language appears as the equivalent of another one in the other language, i.e. compare the letter 'ج' in بۆرج in Azeri Turkish with برج in Farsi.

In addition to the edit distance, if there are rule-based mappings in the data/scripts folder, the CAT is updated accordingly (by adding 1 for each mapping). Finally, any replacement with a score < 0.1 is removed from the matrix.

Datasets

The synthetic datasets are available on GDrive due to large size (2.39G in .tar.gz): link. The real data in Central Kurdish written unconventionally in Arabic and Persian can be found at data/real.

Create your own pipeline!

If you are interested in this project and want to extend it, here are the steps to consider:

  1. Add your corpus to the data/corpus folder
  2. Update the code/config.json file and specify directories to your data and other required files.
  3. Run extract_loanwords.py to extract common words (script should be optimized!)
  4. Add script mapping in TSV format to the data/scripts folder.
  5. Run create_CAT.py to create the character-alignment matrix
  6. Run synthesize.py to generate synthetic data.

You can use any NMT training platform of your choice for training your models. In the paper, we use joeynmt for which the configuration files are provided in the training folder. If using SLURM, you can also use the scripts in training/SLURMs.

Related Projects

Checkout the following related projects too:

Cite this paper

If you use any part of the data, please consider citing this paper as follows:

@inproceedings{ahmadi2023acl,
title = "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities",
author = "Ahmadi, Sina and Anastasopoulos, Antonios",
month = july,
year = "2023",
address = "Tornonto, Cananda",
publisher = "The 61st Annual Meeting of the Association for Computational Linguistics (ACL)"
}

License

Apache License

scriptnormalization's People

Contributors

sinaahmadi avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.