Experiments on "BoolQ" dataset.

Dmitry Eremeev

This repository contains code and results of several experiments on "BoolQ" dataset. It is a question answering dataset for boolean questions (Yes/No) containing (question, passage, answer) triplets.

The experiments are motivated by the original article: BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Details of experiments

The models could be separated into 3 groups: baselines, classifiers on top of pretrained BERT embeddings and fine-tuned BERT model for this task.

Baselines

Constant baseline: assign majority-class to all examples.
FastText baseline: train unsupervised FastText with early stopping based on Logistic Regression accuracy score.

BERT

Separated
Use concatenation of pretrained BERT Embeddings for question and passage [BERT(question), BERT(passage)] as feature-vectors.
Fit Logistic Regression on top.
There are several sub-experiments in this approach:
1. Pretrained BertModel from 🤗 (bert-base-uncased weights).
2. BertModel with only question augmentations.
3. BertModel with both question and passage augmentations.
4. Pretrained DistilBertModel from 🤗 (distilbert-base-uncased weights).
Concat
Fine-tuning BertForSequenceClassification from 🤗 on unified question / passage sequences separated by [SEP] token.

We randomly divide dev.jsonl into 2 splits of equal size for validation (valid.csv) and testing (test.csv).
We used nlpaug library for augmentations:

SynonymAug for questions: substitute similar words (synonyms).
BackTranslationAug for passages: translates text to another language and back.

Results

Files description

EDA.ipynb - notebook with some data exploration.
Baselines
- baselines.py - code for running constant and FastText baselines.
BERT
- bert_concat.py - training loop for BertForSequenceClassification.
- bert_separated.py - functions for fitting classifier on top of concatenated question/passage BERT embeddings.
- dataset.py - Dataset preparation (loading, tokenization, augmentations).
- models.py - wrapper for BertModel.
configs - config files for the models.
data - data files.
outputs - logs and results of conducted experiments.
utils.py - auxiliary functions.
run_baselines.py, run_bert_concat.py, run_bert_separated.py - files for running pipelines.

Training procedure

We use Hydra as config manager. Fill in the configs and run corresponding run files in the root directory.

d-eremeev / boolq-question-answering Goto Github PK

boolq-question-answering's Introduction

Experiments on "BoolQ" dataset.

Dmitry Eremeev

This repository contains code and results of several experiments on "BoolQ" dataset. It is a question answering dataset for boolean questions (Yes/No) containing (question, passage, answer) triplets.

The experiments are motivated by the original article: BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Details of experiments

Results

Files description

Training procedure

boolq-question-answering's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs