GithubHelp home page GithubHelp logo

fstermann / acl2022-self-contrastive-decorrelation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sap-samples/acl2022-self-contrastive-decorrelation

0.0 0.0 0.0 260 KB

Source code for ACL 2022 paper "Self-contrastive Decorrelation for Sentence Embeddings".

License: Apache License 2.0

Shell 0.13% Python 87.69% Jupyter Notebook 12.18%

acl2022-self-contrastive-decorrelation's Introduction

SCD: Self-Contrastive Decorrelation of Sentence Embeddings

made-with-python License REUSE status

News

  • 03/23/2022: ๐ŸŽŠ Source code provided ๐ŸŽ‰
  • 03/16/2022: Added arXiv pre-print, provided models for download

Description

This repository will contain the source code for our paper SCD: Self-Contrastive Decorrelation of Sentence Embeddings to be presented at ACL2022. The code is in parts based on the code from Huggingface Tranformers and the paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.

Abstract

Schematic Illustration of SCD

In this paper, we propose Self-Contrastive Decorrelation, a self-supervised approach, which takes an input sentence and optimizes a joint self-contrastive and decorrelation objective, with only standard dropout. This simple method works surprisingly well, achieves comparable results with state-of-the-art methods on multiple benchmarks without using contrastive pairs. This study opens up avenues for efficient self-supervised learning methods that are more resilient to train on a small batch regime than current contrastive methods.

Authors:

Requirements

Related work

arXiv Download Model

Check out the pre-print of latest work for sentence similarity that facilitates learning embeddings even with minimal amounts of data: "miCSE: Mutual Information Contrastive Learning for Low-shot Sentence Embeddings"

Download and Installation

  1. Clone this repository
git clone https://github.com/SAP-samples/acl2022-self-contrastive-decorrelation scd
cd scd
  1. Install the requirements
pip install -r requirements.txt
  1. Download training data
cd data
sh download_wiki.sh
cd ..
  1. Download evaluation dataset
cd SentEval/data/downstream/
sh download_dataset.sh
cd ../../../
  1. ๐Ÿ”ฅ Patching transformers ๐Ÿ”ฅ

Leveraging multi-dropout for SCD requires patching the language model of Huggingface transformers. Doing is is rather simple. The portions of the code that have to be changed have been indicated with # SCD in the source code. Basically, you need to change the following modules in the specific language model file (with code links for the BERT language model):

To activate training with multi-dropout a couple of flags have to be set in the training script:

https://github.com/SAP-samples/acl2022-self-contrastive-decorrelation/blob/6824cffb99202b0eb6cdc5cad5eeede9f9da74e3/train.py#L430

The source code provided is compatible with version 4.10.0. Later versions should be pretty much identical in what needs to be adapted. In order to avoid complications, I recommend creating your existing environment, such that the modifications of transformer code have no effect on other projects (although, technically there should not be any issue). Information on cloning your environment with conda can be found here.

5.1. Clone Hugging Face transformers v.4.10.0 to the folder 'scd_transformers'

git clone -b v4.10.0 https://github.com/huggingface/transformers.git scd_transformers

5.2. Copy the files from the scd/transformers_v4.10 folder to the corresponding folder in scd_transformers (you might want to make a backup of the files that overwritten in scd_transformers)

cp transformers_v4.10/src/transformers/models/bert/modeling_bert.py ~/scd_transformers/src/transformers/models/bert/
cp transformers_v4.10/src/transformers/models/roberta/modeling_roberta.py ~/scd_transformers/src/transformers/models/roberta/

5.3. Install transformers

cd scd_transformers
pip install -e .

Training and Evaluation

  1. Training BERT with settings as in paper
python train.py --do_train --load_best_model_at_end --fp16 --overwrite_output_dir --description=SCD --eval_steps=250 --evaluation_strategy=steps --hidden_dropout_prob=0.05 --hidden_dropout_prob_noise=0.155 --learning_rate=3e-05 --max_seq_length=32 --metric_for_best_model=sickr_spearman --model_name_or_path=bert-base-uncased --num_train_epochs=1 --output_dir=result --per_device_train_batch_size=192 --report_to=wandb --save_total_limit=0 --task_alpha=1 --task_beta=0.005225 --task_lambda=0.012 --temp=0.05 --train_file=data/wiki1m_for_simcse.txt

Note: If you want to use a language model that has a different embedding dimensionality as BERT-base-uncased/RoBERTa-base (=784), you also need to change the projector input dimensionality accordingly using the argument: --embedding_dim

  1. Convert model to Huggingface format
python scd_to_huggingface.py --path result/<model directory>
  1. Evaluate the model
python evaluation.py --pooler cls_before_pooler --task_set sts --mode test --model_name_or_path result/<model directory>

or if you also want the transfer tasks to be evaluated (takes a bit of time)

python evaluation.py --pooler cls_before_pooler --task_set full --mode test --model_name_or_path result/<model directory>

Language Models

Language models trained for which the performance is reported in the paper are available at the Huggingface Model Repository:

Loading the model in Python. Just place in the model name as indicated above, e.g., sap-ai-research/BERT-base-uncased-SCD-ACL2022.

tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/<----Enter Model Name---->")

model = AutoModelWithLMHead.from_pretrained("sap-ai-research/<----Enter Model Name---->")

With these models one should be able to reproduce the results on the benchmarks are reported in the paper:

For BERT-base-uncased:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 66.94 | 78.03 | 69.89 | 78.73 | 76.23 |    76.30     |      73.18      | 74.19 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 79.26 | 84.85 | 93.57 | 89.13 | 85.23 | 83.60 | 74.84 | 84.35 |
+-------+-------+-------+-------+-------+-------+-------+-------+

For RoBERTa-base:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 63.53 | 77.79 | 69.79 | 80.21 | 77.29 |    76.55     |      72.10      | 73.89 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
+-------+-------+-------+-------+-------+-------+-------+-------+
|   MR  |   CR  |  SUBJ |  MPQA |  SST2 |  TREC |  MRPC |  Avg. |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 82.17 | 87.76 | 93.67 | 85.69 | 88.19 | 83.40 | 76.23 | 85.30 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Citations

If you use this code in your research or want to refer to our work, please cite:

@inproceedings{klein2022scd,
    title={SCD: Self-Contrastive Decorrelation for Sentence Embeddings},
     author = "Klein, Tassilo  and
      Nabi, Moin",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    publisher = "Association for Computational Linguistics (ACL)",
}

How to obtain support

Create an issue in this repository if you find a bug or have questions about the content.

For additional support, ask a question in SAP Community.

Contributing

If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.

License

Copyright (c) 2022 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.

acl2022-self-contrastive-decorrelation's People

Contributors

btbernard avatar tjklein avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.