GithubHelp home page GithubHelp logo

dialqa's Introduction

DialQA

The Dialectal Extractive Question Answering Shared Task invites participants to build QA systems that are robust to dialectal variation. The task builds on existing QA benchmarks TyDi-QA and SD-QA): specifically, it uses portions of the SD-QA dataset, which recorded dialectal variations of TyDi-QA questions. The participants may either (a) use the baseline automatic speech recognition outputs for each dialect with the aim of making a robust text-based QA system, or (b) they may use the provided audio recordings of the questions with the aim of making a dialect-robust ASR system which can be then evaluated with a baseline QA system, or (c) both of the above. The shared task provides development and test data for 5 varieties of English (Nigeria, USA, South India, Australia, Philippines), 4 varieties of Arabic (Algeria, Egypt, Jordan, Tunisia), and 2 varieties of Kiswahili (Kenya, Tanzania), as well as code for training baseline systems with modified TyDi-QA data. Any training data are allowed, except for the TyDi-QA data in the above 3 languages.

Requirements and installation

./install.sh

Data File Structure

data/
	dialqa-train.json
	dialqa-dev-og.json
	dialqa-dev-aug.json
	audio/
		dev/
			{lang}/
				{dialect-region}/
					{lang}-{id}-{dialect-region}.wav
  • dialqa-dev-og.json: Original Development dataset gold questions.
  • dialqa-dev-aug.json: Development dataset with dialectal questions (speech-to-text outputs through automatic ASR). This is our task development dataset.
  • lang: English (eng), Arabic (ara), Kiswahili (swa)
  • audio: folder containing question audio files. The audio file names {lang}-{id}-{dialect-region} have one-to-one mappings with the example ids from the json files.

Baseline (ASR QA)

The task is to perform Extractive-QA using dialectal questions (Speech to text Outputs). We use the Google Speech API with regional units (eg. en-US, sw-TZ) to perform speech to text conversion. The training file is based on huggingface's [run_squad.py] file.

Training baseline:


source vdial/bin/activate


python src/run_squad.py \
	--model_type bert \
	--model_name_or_path=bert-base-multilingual-uncased \
	--do_train \
	--do_lower_case \
	--train_file 'data/dialqa-train.json' \
	--per_gpu_train_batch_size 16 \
	--per_gpu_eval_batch_size 24 \
	--learning_rate 3e-5 \
	--num_train_epochs 3 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir 'train_cache_output/' \
	--overwrite_cache \
	--overwrite_output_dir

Prediction on augmented dev data

python src/run_squad.py \
	--model_type bert \
	--model_name_or_path='train_cache_output' \
	--do_eval \
	--do_lower_case \
	--predict_file 'data/dialqa-dev-aug.json' \
	--per_gpu_train_batch_size 16 \
	--per_gpu_eval_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 3 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir 'outputs/aug-mbert' \
	--overwrite_output_dir

Prediction on test data

python src/run_squad.py \
	--model_type bert \
	--model_name_or_path='train_cache_output' \
	--do_eval \
	--do_lower_case \
	--predict_file 'data/dialqa-test.json' \
	--per_gpu_train_batch_size 16 \
	--per_gpu_eval_batch_size 16 \
	--learning_rate 3e-5 \
	--num_train_epochs 3 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir 'outputs/test-mbert' \
	--overwrite_output_dir

Baseline results

Language-Dialect F1 Exact Match Example Count
English-Nigeria (nga) 73.36 58.70 494
English-United States (usa) 74.35 59.31 494
English-South India (ind_s) 72.22 58.10 494
English-Australia (aus) 73.67 59.52 494
English-Philippines (phl) 73.76 59.11 494
English-Dialect (avg) 73.47 58.95 2470
Arabic-Algeria (dza) 71.72 56.17 324
Arabic-Egypt (egy) 72.39 56.79 324
Arabic-Jordan (jor) 73.27 57.41 324
Arabic-Tunisia (tun) 73.55 57.71 324
Arabic-Dialect (avg) 72.73 57.02 1296
Kiswahili-Kenya (ken) 72.12 63.1 1000
Kiswahili-Tanzania (tza) 70.74 61.7 1000
Kiswahili-Dialect (avg) 71.43 62.4 2000
All Language (avg) 72.60 59.71 5766

Citation

Audio files and augmented dataset are from SD-QA which was built on top of TyDiQA.

@inproceedings{faisal-etal-2021-sd-qa,
    title = "{SD}-{QA}: Spoken Dialectal Question Answering for the Real World",
    author = "Faisal, Fahim  and
      Keshava, Sharlina  and
      Alam, Md Mahfuz Ibn  and
      Anastasopoulos, Antonios",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.281",
    doi = "10.18653/v1/2021.findings-emnlp.281",
    pages = "3296--3315",
}
@article{clark-etal-2020-tydi,
    title = "{T}y{D}i {QA}: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages",
    author = "Clark, Jonathan H.  and
      Choi, Eunsol  and
      Collins, Michael  and
      Garrette, Dan  and
      Kwiatkowski, Tom  and
      Nikolaev, Vitaly  and
      Palomaki, Jennimaria",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "8",
    year = "2020",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2020.tacl-1.30",
    doi = "10.1162/tacl_a_00317",
    pages = "454--470",
}

License

The data is availalbe under the Apache License 2.0.

dialqa's People

Contributors

ffaisal93 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.