GithubHelp home page GithubHelp logo

princeton-nlp / mabel Goto Github PK

View Code? Open in Web Editor NEW
37.0 4.0 2.0 6.06 MB

EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975

License: MIT License

Python 97.22% Jupyter Notebook 1.81% Shell 0.97%
contrastive-learning fairness natural-language-processing gender-bias

mabel's Introduction

MABEL: Attenuating Gender Bias using Textual Entailment Data

Authors: Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen

This repository contains the code for our EMNLP 2022 paper, "MABEL: Attenuating Gender Bias using Textual Entailment Data".

MABEL (a Method for Attenuating Bias using Entailment Labels) is a task-agnostic intermediate pre-training technique that leverages entailment pairs from NLI data to produce representations which are both semantically capable and fair. This approach exhibits a good fairness-performance tradeoff across intrinsic and extrinsic gender bias diagnostics, with minimal damage on natural language understanding tasks.

Training Schema

Table of Contents

Quick Start

With the transformers package installed, you can import the off-the-shelf model like so:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

model = AutoModelForMaskedLM.from_pretrained("princeton-nlp/mabel-bert-base-uncased")

Model List

MABEL Models ICAT ↑
princeton-nlp/mabel-bert-base-uncased 73.98
princeton-nlp/mabel-bert-large-uncased 73.45
princeton-nlp/mabel-roberta-base 69.68
princeton-nlp/mabel-roberta-large 69.49

Note: The ICAT score is a bias metric that consolidates a model's capacity for language modeling and stereotypical association into a single numerical indicator. More information can be found in the StereoSet (Nadeem et al., 2021) paper.

Training

Before training, make sure that the counterfactually-augmented NLI data, processed from SNLI and MNLI, is downloaded and stored under the training directory as entailment_data.csv.

1. Install package dependencies

pip install -r requirements.txt

2. Run training script

cd training
chmod +x run.sh 
./run.sh

You can configure the hyper-parameters in run.sh accordingly. Models are saved to out/. The optimal set of hyper-parameters varies depending on the choice of backbone encoder, and the full training details can be found in the paper.

Evaluation

Intrinsic Metrics

If you use your own trained model instead of our provided HF checkpoint, you must first run python -m training.convert_to_hf --path /path/to/your/checkpoint --base-model bert (which converts the checkpoint to a standard BertForMaskedLM model - use --base_model roberta for RobertaForMaskedLM) prior to intrinsic evaluation.

Also, please note that we use Meade et al.'s method of computation and datasets for both StereoSet and CrowS-Pairs; this is why the metrics for the pre-trained models are not directly comparable to those reported in the original benchmark papers.

1. StereoSet (Nadeem et al., 2021)

Command:

python -m benchmark.intrinsic.stereoset.predict --model_name_or_path princeton-nlp/mabel-bert-base-uncased && 
python -m benchmark.intrinsic.stereoset.eval

Output:

intrasentence
gender
Count: 2313.0
LM Score: 84.5453251710623
SS Score: 56.248299466465376
ICAT Score: 73.98003496789251

Collective Results:

Models LM ↑ SS ◇ ICAT ↑
bert-base-uncased 84.17 60.28 66.86
princeton-nlp/mabel-bert-base-uncased 84.54 56.25 73.98
bert-large-uncased 86.54 63.24 63.62
princeton-nlp/mabel-bert-large-uncased 84.93 56.76 73.45
roberta-base 88.93 66.32 59.90
princeton-nlp/mabel-roberta-base 87.44 60.14 69.68
roberta-large 88.81 66.82 58.92
princeton-nlp/mabel-roberta-large 89.72 61.28 69.49

◇: The closer to 50, the better.

2. CrowS-Pairs (Nangia et al., 2021)

Command:

python -m benchmark.intrinsic.crows.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased

Output:

====================================================================================================
Total examples: 262
Metric score: 50.76
Stereotype score: 51.57
Anti-stereotype score: 49.51
Num. neutral: 0.0
====================================================================================================

Collective Results:

Models Metric Score ◇
bert-base-uncased 57.25
princeton-nlp/mabel-bert-base-uncased 50.76
bert-large-uncased 55.73
princeton-nlp/mabel-bert-large-uncased 51.15
roberta-base 60.15
princeton-nlp/mabel-roberta-base 49.04
roberta-large 60.15
princeton-nlp/mabel-roberta-large 54.41

◇: The closer to 50, the better.

Extrinsic Metrics

  1. Occupation Classification

See benchmark/extrinsic/occ_cls/README.md for full training instructions and results.

  1. Natural Language Inference

See benchmark/extrinsic/nli/README.md for full training instructions and results.

  1. Coreference Resolution

See benchmark/extrinsic/coref/README.md for full training instructions and results.

Language Understanding

1. GLUE (Wang et al., 2018)

We fine-tune on GLUE through the transformers library, following the default hyper-parameters.

A straightforward way is to download the current transformers repository:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

Then set up the environment dependencies:

cd ./examples/pytorch/text-classification
pip install -r requirements.txt

Here is a sample script for one of the GLUE tasks, MRPC:

# task options: cola, sst2, mrpc, stsb, qqp, mnli, qnli, rte 
export TASK_NAME=mrpc
export OUTPUT_DIR=out/

CUDA_VISIBLE_DEVICES=0 python run_glue.py \
  --model_name_or_path princeton-nlp/mabel-bert-base-uncased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir $OUTPUT_DIR

2. SentEval Transfer Tasks (Conneau et al., 2018)

Preprocess:

Make sure you have cloned the SentEval repo and added its contents into this repository's transfer folder, and run ./get_transfer_data.bash in data/downstream to download the evaluation data.

Command:

python -m benchmark.transfer.eval --model_name_or_path princeton-nlp/mabel-bert-base-uncased --task_set transfer

Output:

+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 78.33 | 85.83 | 93.78 | 89.13 | 85.50 | 85.20 | 68.87 | 83.81 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Collective Results:

Models Transfer Avg. ↑
bert-base-uncased 83.73
princeton-nlp/mabel-bert-base-uncased 83.81
bert-large-uncased 86.54
princeton-nlp/mabel-bert-large-uncased 86.09

Code Acknowledgements

Citation

@inproceedings{he2022mabel,
   title={{MABEL}: Attenuating Gender Bias using Textual Entailment Data},
   author={He, Jacqueline and Xia, Mengzhou and Fellbaum, Christiane and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2022}
}

mabel's People

Contributors

danqi avatar dependabot[bot] avatar jacqueline-he avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mabel's Issues

Ask for datasets help

Hi,I read your article and found that the experimental results were very effective. I learned that your training datasets came from MNLI and SNLI, but I didn't find the specific preprocessing steps. Could you please provide the preprocessed code?Thank you very much!

Different results on Bias-NLI evaluation dataset

Hi,

For the Extrinsic Benchmark: Natural Language Inference, I downloaded your provided princeton-nlp/mabel-bert-base-uncased checkpoint -- mabel_checkpoint_best.pt model and ran the Evaluation script you provided using the command:
python eval.py --model_name_or_path bert-base-uncased --load_from_file nli-mabel/mabel_checkpoint_best.pt --eval_data_path bias-nli/nli-dataset.csv.
However, I did not achieve the same high results as reported by you:

total net neutral:        0.9170128866319063
total fraction neutral:   0.9828041166841379
total tau 0.5 neutral:    0.9824839530908697
total tau 0.7 neutral:    0.9681385585408803

My results were as follows:

total net neutral:        0.872817846119612
total fraction neutral:   0.9396406253855012
total tau 0.5 neutral:    0.9385957916322645
total tau 0.7 neutral:    0.9039280814706394

I was wondering if there might be a mistake in my implementation or if others have reported similar results. Thank you for taking the time to read my message and any help you could offer would be greatly appreciated.

Using crows pairs multiple times will result in different measurements

hi,
I ran into a problem with multiple crows-pairs metrics with different results. After training the model using bert-base-cased/ save locally, I used crows pairs multiple times to measure different results. Figure:
image

Thank you very much for answering my questions and solving them, which is very helpful for my research.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.