GithubHelp home page GithubHelp logo

sd-qa's Introduction

SD-QA

Data File Structure

dev/
    -lang/
        -language.dev.csv
        -language.dev.txt
        -dialect/
            -language.dev.dialect-ASR.txt.jsonl.gz
            -language.dev.dialect-ASR.txt
            -metadata.csv
            -wav_lang/
    	        -ID.wav

test/
    -lang/
        -language.test.csv
        -language.test.txt
        -dialect/
            -language.test.dialect-ASR.txt.jsonl.gz
            -language.test.dialect-ASR.txt
            -metadata.csv
            -wav_lang/
    	        -ID.wav

-asr_metadata/
	-dev/
		-asr_output_with_metadata_lang.csv
	-test/
		-asr_output_with_metadata_lang.csv
  • lang: eg. eng, ara
  • language.dev.csv: language specific csv file containing gold and ASR transcripts for all dialects
  • language.dev.txt: language specific text file containing gold data
  • language.dev.dialect-ASR.txt.jsonl.gz: Language and dialect specific TyDi-QA format datafile (gold question replaced with transcript)
  • metadata.csv: metadata file (example ID-user ID mapping with additional info for each dialect and language)
  • wav_lang: folder containing audio files
  • asr_output_with_metadata_lang.csv: single language specific csv file containing all metadata, transcripts with word error rate for each example instance

WER based evaluation on ASR outputs

Comparative minimal answer predictions for Error analysis

Baseline-TydiQA

We train a tydiqa baseline model for the primary task evaluation. Instead of using the original training data, we use the discard_dev version (SDQA development questions are discarded from the training data).

Available model and training data for download:

Experimenting with a primary task baseline

Detailed steps to train a tydiqa primary task baseline model is here

prepare the training samples:
python3 baselines/tydiqa/baseline/prepare_tydi_data.py \
  --input_jsonl=tydiqa_data/tydiqa-v1.0-train-discard-dev.jsonl.gz \
  --output_tfrecord=tydiqa_data/train_tf/train_samples.tfrecord \
  --vocab_file=baselines/tydiqa/baseline/mbert_modified_vocab.txt \
  --record_count_file=tydiqa_data/train_tf/train_samples_record_count.txt \
  --include_unknowns=0.1 \
  --is_training=true
prepare dev samples from all language-dialect specific asr outputs
./experiments/test_prep.sh tydiqa_data/dev tydiqa_data/dev_tf
prepare test samples from all language-dialect specific asr outputs
./experiments/test_prep.sh tydiqa_data/test tydiqa_data/test_tf
train
python3 baselines/tydiqa/baseline/run_tydi.py \
  --bert_config_file=mbert_dir/bert_config.json \
  --vocab_file=baselines/tydiqa/baseline/mbert_modified_vocab.txt \
  --init_checkpoint=mbert_dir/bert_model.ckpt \
  --train_records_file=tydiqa_data/train_tf/train_samples.tfrecord \
  --record_count_file=tydiqa_data/train_tf/train_samples_record_count.txt \
  --do_train \
  --output_dir=trained_models/
Predict

Once the model is trained, we run inference on the dev/test set:

dev:

./experiments/test_predict.sh \
tydiqa_data/dev tydiqa_data/dev_predict tydiqa_data/dev_tf \
trained_models/model.ckpt discard_dev mbert_dir

test:

./experiments/test_predict.sh \
tydiqa_data/test tydiqa_data/test_predict tydiqa_data/test_tf \
trained_models/model.ckpt discard_dev mbert_dir
  • to point the trained checkpoint at --init_checkpoint, write correct location inplace of trained_models/model.ckpt
  • write downloaded mbert location inplace of mbert_dir
Evaluate

Citation

If you use SD-QA, please cite the "SD-QA: Spoken Dialectal Question Answering for the Real World". You can use the following BibTeX entry

@inproceedings{faisal-etal-21-sdqa,
 title = {{SD-QA}: {S}poken {D}ialectal {Q}uestion {A}nswering for the {R}eal {W}orld},
  author = {Faisal, Fahim and Keshava, Sharlina and ibn Alam, Md Mahfuz and Anastasopoulos, Antonios},
  url={https://arxiv.org/abs/2109.12072},
  year = {2021},
  booktitle = {Findings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings)},
  publisher = {Association for Computational Linguistics},
  month = {November},
}

We built our augmented dataset and baselines on top of TydiQA. Kindly also make sure to cite the original TyDi QA paper,

@article{tydiqa,
title   = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages},
author  = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}
journal = {TACL},
year    = {2020}
}

License

Both the code and data for SD-QA are availalbe under the Apache License 2.0.

sd-qa's People

Contributors

ffaisal93 avatar antonisa avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.