GithubHelp home page GithubHelp logo

sinclaircoder / dataaug4nlp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from styfeng/dataaug4nlp

0.0 1.0 0.0 113 KB

Collection of papers and resources for data augmentation for NLP.

Home Page: https://arxiv.org/abs/2105.03075

dataaug4nlp's Introduction

Data Augmentation Techniques for NLP

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, and adversarial examples.

This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:

@article{feng2021survey,
  title={A Survey of Data Augmentation Approaches for NLP},
  author={Feng, Steven Y and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard},
  journal={Findings of ACL},
  year={2021}
}

Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Note: inquiries should be directed to [email protected] or by opening an issue here.

Text Classification

Paper Datasets
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) twitter
Robust Training under Linguistic Adversity (EACL '17) code Movie review, customer review, SUBJ, SST
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code SST, SUBJ, MRQA, RT, TREC
Variational Pretraining for Semi-supervised Text Classification (ACL '19) code IMDB, AG News, Yahoo, hatespeech
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code SST, CR, SUBJ, TREC, PC
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) TREC, SST, Subj, MR
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code AG News, DBpedia, Yahoo, IMDb
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code Yelp, IMDb, amazon, DBpedia
Not Enough Data? Deep Learning to the Rescue! (AAAI '20) ATIS, TREC, WVA
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code IWSLT'14
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) SST2, TREC
Text Augmentation in a Multi-Task View (EACL '21) SST2, TREC, SUBJ
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code HUFF, COV-Q, AMZN, FEWREL

Natural Language Generation

Paper Datasets
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) code TO-DO

Translation

Paper Datasets
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) WMT '15 en-de, IWSLT ''15 en-tr
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code IWSLT '14 de/es/he-en, WMT '14 en-de

Question Answering

Paper Datasets
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop) MRQA
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19) SQuAD, Trivia-QA, CMRC, DRCD
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19) XNLI, SQuAD
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20) MLQA, XQuAD, SQuAD-it, PIAF
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) code WIQA, QuaRel, HotpotQA

Summarization

Paper Datasets
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19) DUC
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19) Swisstext, commoncrawl
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21) CNN-DailyMail

Sequence Tagging

Paper Datasets
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) code universal dependencies project

Parsing

TODO: https://www.aclweb.org/anthology/2020.emnlp-main.107/

Grammatical Error Correction

Paper Datasets
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18) Falko-MERLIN GEC Corpus
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) CoNLL-2014 , JFLEG
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) IWSLT 16 en-vi, IWSLT 15 de-en, WMT en-de
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
A neural grammatical error cor-rection system built on better pre-training and se-quential transfer learning. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task), Gutenberg, Tatoeba, WikiText-103 (Pretraining)
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18) Lang-8, CoNLL-2014, CoNLL-2013, JFLEG
Corpora Generation for Grammatical Error Correction (NAACL'19) CoNLL-2014, JFLEG, Lang-8

Dialogue

Multimodal

Mitigating Bias

Mitigating Class Imbalance

Adversarial examples

Paper Datsets
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) SST, SICK
Make sure we get textattack

Compositionality

Paper Datsets
Good-Enough Compositional Data Augmentation (ACL '20) code SCAN
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) code SCAN

Popular Resources

dataaug4nlp's People

Contributors

jasonwei20 avatar vgtomahawk avatar styfeng avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.