nyu-mll / jiant Goto Github PK

jiant is an nlp toolkit

License: MIT License

Python 98.74% Shell 1.26%

nlp sentence-representation bert multitask-learning transformers transfer-learning

jiant's Introduction

🚨Update🚨: As of 2021/10/17, the jiant project is no longer being actively maintained. This means there will be no plans to add new models, tasks, or features, or update support to new libraries.

`jiant` is an NLP toolkit

The multitask and transfer learning toolkit for natural language processing research

Why should I use jiant?

jiant supports multitask learning
jiant supports transfer learning
jiant supports 50+ natural language understanding tasks
jiant supports the following benchmarks:
- GLUE
- SuperGLUE
- XTREME
jiant is a research library and users are encouraged to extend, change, and contribute to match their needs!

A few additional things you might want to know about jiant:

jiant is configuration file driven
jiant is built with PyTorch
jiant integrates with datasets to manage task data
jiant integrates with transformers to manage models and tokenizers.

Getting Started

Get started with some simple Examples
Learn more about jiant by reading our Guides
See our list of supported tasks

Installation

To import jiant from source (recommended for researchers):

git clone https://github.com/nyu-mll/jiant.git
cd jiant
pip install -r requirements.txt

# Add the following to your .bash_rc or .bash_profile 
export PYTHONPATH=/path/to/jiant:$PYTHONPATH

If you plan to contribute to jiant, install additional dependencies with pip install -r requirements-dev.txt.

To install jiant from source (alternative for researchers):

git clone https://github.com/nyu-mll/jiant.git
cd jiant
pip install . -e

To install jiant from pip (recommended if you just want to train/use a model):

pip install jiant

We recommended that you install jiant in a virtual environment or a conda environment.

To check jiant was correctly installed, run a simple example.

Quick Introduction

The following example fine-tunes a RoBERTa model on the MRPC dataset.

Python version:

from jiant.proj.simple import runscript as run
import jiant.scripts.download_data.runscript as downloader

EXP_DIR = "/path/to/exp"

# Download the Data
downloader.download_data(["mrpc"], f"{EXP_DIR}/tasks")

# Set up the arguments for the Simple API
args = run.RunConfiguration(
   run_name="simple",
   exp_dir=EXP_DIR,
   data_dir=f"{EXP_DIR}/tasks",
   hf_pretrained_model_name_or_path="roberta-base",
   tasks="mrpc",
   train_batch_size=16,
   num_train_epochs=3
)

# Run!
run.run_simple(args)

Bash version:

EXP_DIR=/path/to/exp

python jiant/scripts/download_data/runscript.py \
    download \
    --tasks mrpc \
    --output_path ${EXP_DIR}/tasks
python jiant/proj/simple/runscript.py \
    run \
    --run_name simple \
    --exp_dir ${EXP_DIR}/ \
    --data_dir ${EXP_DIR}/tasks \
    --hf_pretrained_model_name_or_path roberta-base \
    --tasks mrpc \
    --train_batch_size 16 \
    --num_train_epochs 3

Examples of more complex training workflows are found here.

Contributing

The jiant project's contributing guidelines can be found here.

Looking for `jiant v1.3.2`?

jiant v1.3.2 has been moved to jiant-v1-legacy to support ongoing research with the library. jiant v2.x.x is more modular and scalable than jiant v1.3.2 and has been designed to reflect the needs of the current NLP research community. We strongly recommended any new projects use jiant v2.x.x.

jiant 1.x has been used in in several papers. For instructions on how to reproduce papers by jiant authors that refer readers to this site for documentation (including Tenney et al., Wang et al., Bowman et al., Kim et al., Warstadt et al.), refer to the jiant-v1-legacy README.

Citation

If you use jiant ≥ v2.0.0 in academic work, please cite it directly:

@misc{phang2020jiant,
    author = {Jason Phang and Phil Yeres and Jesse Swanson and Haokun Liu and Ian F. Tenney and Phu Mon Htut and Clara Vania and Alex Wang and Samuel R. Bowman},
    title = {\texttt{jiant} 2.0: A software toolkit for research on general-purpose text understanding models},
    howpublished = {\url{http://jiant.info/}},
    year = {2020}
}

If you use jiant ≤ v1.3.2 in academic work, please use the citation found here.

Acknowledgments

This work was made possible in part by a donation to NYU from Eric and Wendy Schmidt made by recommendation of the Schmidt Futures program, and by support from Intuit Inc.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU in this work.
Developer Jesse Swanson is supported by the Moore-Sloan Data Science Environment as part of the NYU Data Science Services initiative.

License

jiant is released under the MIT License.

jiant's People

Contributors

Stargazers

Watchers

Forkers

ahmedhani codeaudit geogubd zhouyonglong amit2014 ruotianluo bin2000 tony32769 pku-wuwei munaachyuta matt-peters haven-jeon ski-net kevintrannz tomohideshibata wangluyi1982 anhad13 pmal19 yyht lzjtt2017 lpb123 yucoian awesome-archive saamc bordias meinwerk pruksmhc omkar13 ccmaymay brucewalthers easonla codehacken notani xiaoyizy michael-wzhu xinwangmath pterosdiacos jplalor bvanaken pep8speaks mysqlsc gauravsc mkarutz chenhuims shuningjin stephao julka01 narsil nangeblog ksboy vidalarroyo yi-shiuan-tung databill86 jon-chun ypruksachatkun-asapp newcomingsoon njjiang junsungson davidbenton rucyuliteam mrbai333 jamshaidsohail5 hyunjay dzorlu mahjiong gyanachand1 vamshirapolu b2220333 positivedefinite luowangda vachelch bendavis-chicago liujun26 gouravawasthi avnermay masaki-hamada kl2806 fragypig sandy-slin triper1022 gaoyiyeah devkosal sigmaquan ljz1697171167 brianann2339 shubhampachori12110095 caozekun awaemmanuel intuitionmachine mr-bug-lv auscenery pitrack rp1124 esteng pradeepmishra11 dragomirradev liuhd1992 jjon-boat dukeenglish psli01

jiant's Issues

implement pre-pooling linear projection

For downstream tasks' components, allow for a learned linear projection of the core RNN's hidden states before any pooling

Stream data from disk instead of loading entirely to RAM

Training on MultiNLI w/ELMo takes 65G of RAM, which is a lot - this will get painful with larger datasets. See if we can stream from disk instead of loading the whole set into memory.

Make checkpointing during target-task training less wonky.

Currently we start overwriting checkpoints once we switch from pretraining to target task training.

switch from basic iterator to bucket iterator

We're using BasicIterator to go through data but should be using a BucketIterator that groups examples by some key (usually we want sentence length).

See relevant documentation here: https://allenai.github.io/allennlp-docs/api/allennlp.data.iterators.html

Final reported number in eval phase may not match best reported number

cola_mcc at the bottom row shouldn't be lower than in the overall validation. This is a run training on mnli-fiction and evaluating on cola.

/nfs/jsalt/exp/sam-gpu1b-4/tuning_fine_tuning/cola_st/log.log

***** VALIDATION RESULTS *****
cola_mcc, 17, cola_loss: 0.65820, macro_avg: 0.27127, micro_avg: 0.27127, cola_mcc: 0.27127, cola_accuracy: 0.66059
micro_avg, 17, cola_loss: 0.65820, macro_avg: 0.27127, micro_avg: 0.27127, cola_mcc: 0.27127, cola_accuracy: 0.66059
macro_avg, 17, cola_loss: 0.65820, macro_avg: 0.27127, micro_avg: 0.27127, cola_mcc: 0.27127, cola_accuracy: 0.66059
Loaded model state from /nfs/jsalt/exp/sam-gpu1b-4/tuning_fine_tuning/cola_st/model_state_eval_best.th
Evaluating...
micro_accuracy: 0.382, macro_accuracy: 0.351, mnli-fiction_accuracy: 0.452, cola_mcc: 0.250, cola_accuracy: 0.711

Parameters for sequence generation, LM tasks won't fine-tune.

This is not really an issue if we're only fine-tuning on GLUE tasks but

At fine-tuning time, we get the parameters to fine-tune by finding model attribute (%s_mdl) % task.name but for several types of tasks, the attributes to be trained are set with different names, e.g. %s_decoder.

Suggested solution, wrap components in an nn.Module to easily get parameters.

We're using a ton of disk space for checkpoints. Save fewer of them.

Flag-guard import pdb

Test mode

Feature that could save us a lot of grief later today: Set up a flag to load no more than ~1000 (or some other set number) examples from each data file. This will let people make sure that it's possible to train the standard model on their tasks without waiting through potentially slow preprocessing first.

training eval-only tasks will erase model checkpoints create during training

Currently training task-specific components for eval-only tasks will delete (since we're starting training without loading a state) / overwrite (since the checkpoints will share the same names) the model checkpoints that we're created during the main training loop.

Train a GLUE MTL run for use in analysis fine-tuning.

Unassigned. Can start after the main big experiment.

You have to re-download ELMo on every new GCP instance. Can we add it to the default environment setup/template?

wikiedits pretraining task

modified seq2seq that feeds sentence encoding+pointer to insertion location into MLP before decoding, to predict inserted spans at arbitrary points in the sentence.

find sensible transformer params

There are a a few transformer parameters we've been avoiding (projection dimension, feedforward dimension) by setting it all to the same value, in addition to training hyperparameters. We should probably go through what literature there is and find sensible defaults

updated script for parsing log file to work on logs that haven't finished yet

Write a shell script to extract a spreadsheet line from a log file

Desired output:
$ extract_results.sh /nfs/jsalt/exp/sam-/tuning_/*/log.log
Ellie 7/2/2018 mnli N Y N 0.2 58.8 59.0 2.8 82.7 71.1 76.4 53.1 51.1 62.2 77.3 56 43.7 81.5 79.5 gs://jsalt-models/ellie_runs/take2_base_transformer_lowlr
Ellie 7/2/2018 mnli N Y N 0.2 57.8 60.5 0 83.6 71.3 76.8 56.9 54.6 61.4 76.8 56 56.3 81.8 79.6 gs://jsalt-models/ellie_runs/take2_base_transformer_lowwarm
...

The results can then be pasted into the spreadsheet directly. Ideally it should also be able to extract results from runs that have finished some but not all GLUE tasks.

Checkpoints are huge.

100MB for the demo setup. Are we saving anything we shouldn't, like the ELMo char layer?

bidirectional rnn language model

Using the standard Pytorch RNN won't work for multi layer bidirectional language models because the directions are aggregated between layers. See elmo lstm for an example of how to do this (and maybe an easy work around?).

train_tasks = glue, error

The config file <defaults2.conf> is attached. (delete suffix ".txt" to use)
defaults2.conf.txt

Works fine if train_task = "sst,mrpc,rte".
When train_tasks = glue, error:

Traceback (most recent call last):
File "src/main.py", line 207, in
main(sys.argv[1:])
File "src/main.py", line 156, in main
args.shared_optimizer, args.load_model, phase="main")
File "/home/sa_112949933820817028848/jiant/src/trainer.py", line 333, in train
output_dict = self._forward(batch, task=task, for_training=True)
File "/home/sa_112949933820817028848/jiant/src/trainer.py", line 604, in _forward
return self._model.forward(task, tensor_batch) # , **tensor_batch)
File "/home/sa_112949933820817028848/jiant/src/models.py", line 351, in forward
out = self._single_sentence_forward(batch, task)
File "/home/sa_112949933820817028848/jiant/src/models.py", line 385, in _single_sentence_forward
task.scorer1(labels, preds.data.cpu().numpy())
TypeError: call() takes 2 positional arguments but 3 were given

writeup options and trainer setup in README

^ so people know how to use things

Don't load the test set unless you need it.

This probably wastes time with QQP, which has a huge test set.

Scheduler params

Two issues

when using Transformer, we use the NoamScheduler, but that should take a step every update, not every validation check
when using ReduceLROnPlateau, we assume 'max', but some when training some tasks on their own, we want to minimize some metric (e.g. perplexity)

Save optimizer and LR scheduler state

Current code to save and load optimizer and scheduler states are clunky and/or broken. Find a more pytorch-onic way to do it.

Remote logging for training jobs

In case of crashes or SSH issues, useful to collect logs in a central place for monitoring.

I think we can do this easily by linking Stackdriver Logging into our trainer script: https://cloud.google.com/logging/docs/setup/python

Delete checkpoints when starting training from scratch

If LOAD_MODEL is 0, all checkpoints should be deleted. Otherwise, you can wind up in an odd setup (easy to hit with demo experiments):

Train model for 10 epochs with settings S.
Save model checkpoints for epochs 1–10.
Train new model for 5 epochs with settings S', and LOAD_MODEL=0.
Save model checkpoints for epoch 1–5, overwriting old checkpoints.
Try to continue training the new model with LOAD_MODEL=1. Wind up using settings S' but loading the highest-numbered epoch checkpoint, which was created with settings S. Crash.

Use ELMo character handling in all experiments, set as default

modify AllenNLP RNN decoder to fit in framework (sequence generation tasks)

Specifically for translation, denoising tasks. The AllenNLP RNN decoder should have beam search already implemented, but you might need to poke around for validation metrics such as BLEU.

AllenNLP vocabulary warning.

See if this is worth worrying about:

06/18 06:07:17 PM: Your label namespace was 'idxs'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for non_padded_namespaces parameter in Vocabulary.

Debug GPU memory usage with fast-starting experiments.

If there are no fundamental issues, we may have to just shrink the base model. In that case, we should figure out what's most efficient to shrink.

set up command line invocation for probing with NLI-style data

We need a way to take a trained model and run it on small probing test sets, assuming the probing test sets are in the form of p/h pairs.

STS-B classifier is broken

06/27 10:18:45 AM: Beginning training. Stopping metric: sts-b_corr
Traceback (most recent call last):
File "main.py", line 185, in
sys.exit(main(sys.argv[1:]))
File "main.py", line 162, in main
args.shared_optimizer, load_model=False, phase="eval")
File "/Users/Bowman/Drive/JSALT/jiant/src/trainer.py", line 275, in train
output_dict = self._forward(batch, task=task, for_training=True)
File "/Users/Bowman/Drive/JSALT/jiant/src/trainer.py", line 521, in _forward
return self._model.forward(task, tensor_batch) # , **tensor_batch)
File "/Users/Bowman/Drive/JSALT/jiant/src/models.py", line 355, in forward
out = self._pair_regression_forward(batch, task)
File "/Users/Bowman/Drive/JSALT/jiant/src/models.py", line 433, in _pair_regression_forward
logits = classifier(s1, s2, s1_mask, s2_mask)
File "/Users/Bowman/anaconda3/envs/jiant/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
TypeError: forward() takes 2 positional arguments but 5 were given

@hyinghui, @W4ngatang, anyone else who's seen this code—could you take a look?

TensorboardX support?

Will be useful to monitor / debug models while training. Not sure how much work this is.

Add a projection layer before max pooling

@W4ngatang ?

Runs can fail when started in quick succession with shared exp_name. Race condition?

One recent example:

07/02 05:17:40 PM: Fatal error in main():
Traceback (most recent call last):
File "main.py", line 207, in
main(sys.argv[1:])
File "main.py", line 105, in main
train_tasks, eval_tasks, vocab, word_embs = build_tasks(args)
File "/home/sbowman/jiant/src/preprocess.py", line 170, in build_tasks
load_pkl=bool(not args.reload_tasks))
File "/home/sbowman/jiant/src/preprocess.py", line 298, in get_tasks
os.mkdir(task_scratch_path)
FileExistsError: [Errno 17] File exists: '/misc/vlgscratch4/BowmanGroup/sbowman/exp//main-random/SST-2/'

Solved by restarting the later runs.

Switch to AllenNLP ELMoTokenEmbedder

Use ELMoTokenEmbedder instead of ELMo() so people don't need to modify their source code

separate training parameters for train tasks and eval tasks

We want to allow for more fine-grained control between training parameters on the main task we're training on and auxiliary tasks we're evaluating on.

If we're not training on multiple tasks, it would probably make sense to switch to a deterministic trainer that validates after passing through the entire training set (as opposed to a fixed number of batches).

Config file system to replace multitude of flags

Desiderata:

Configuration by files, with some sort of templating and inheritance
Command-line overrides
Save a copy of params to the run dir, so models can be re-loaded later

include options for specifying classifier to use for probing tasks

My model is too big.

350m+ parameters with ELMo and attention. Many of them in places where they don't seem that useful.

Double check that early stopping strategy is sane for single-task GLUE training

Worry about storage

Preprocessed copies of some datasets (LM, MT) are very large. Not a problem to store a few copies, but we'll use up our NFS volume very quickly if every run results in a ~70G copy of the data.

Propose implementing a global preprocessing directory that we can store a single copy in, and trainer will look there if a local one is not found. exp_dir already provides a mechanism for this, but its difficult to share one directory across multiple workers.

Worry about memory during indexing (or just index everything on the fly/on a big machine)

PyTorch clip warning

Fix this:

.../src/trainer.py:258: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.

Not possible to add a new eval task when loading an old pretrained model.

Low priority, but should find a fix eventually.

Worry about speed

Not top priority, but the largest model gets about 150 steps per minute, so a large training run (500k steps) could take two or three days. If anyone has spare bandwidth, do some CPU profiling and make sure we're not wasting time on anything. If you're very bored, try some GPU profiling too, though I doubt there's much to optimize there.

refactor main so all blocks of code are opt-in

Crashing when trying to use only char embs in demo.sh

bash ./demo.sh -w none
[...]
Traceback (most recent call last):
File "../src/main.py", line 236, in
sys.exit(main(sys.argv[1:]))
File "../src/main.py", line 186, in main
args.shared_optimizer, args.load_model)
File "/Users/Bowman/Drive/JSALT/jiant/src/trainer.py", line 273, in train
output_dict = self._forward(batch, task=task, for_training=True)
File "/Users/Bowman/Drive/JSALT/jiant/src/trainer.py", line 504, in _forward
return self._model.forward(task, tensor_batch) # , **tensor_batch)
File "/Users/Bowman/Drive/JSALT/jiant/src/models.py", line 291, in forward
out = self._single_classification_forward(batch, task)
File "/Users/Bowman/Drive/JSALT/jiant/src/models.py", line 309, in _single_classification_forward
sent_embs, sent_mask = self.sent_encoder(batch['input1'])
File "/Users/Bowman/anaconda3/envs/jiant/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/Users/Bowman/Drive/JSALT/jiant/src/modules.py", line 75, in forward
sent_embs = self._highway_layer(self._text_field_embedder(sent))
File "/Users/Bowman/anaconda3/envs/jiant/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/Users/Bowman/anaconda3/envs/jiant/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 63, in forward
raise ConfigurationError(message)
allennlp.common.checks.ConfigurationError: "Mismatched token keys: dict_keys(['chars']) and dict_keys(['words', 'chars'])"