Code for the paper Evaluating How Fine-tuning on Bimodal Data Effects Code Generation
- Clone this repo with
git clone https://github.com/gabeorlanski/springresearch.git
- Install the requirements with
pip install -r requirements.txt
- Install these python libraries from their repositories:
Configs for this project use
the Hydra Framework. The main
configs are located
in the conf
directory
. The two most important ones are
train_config.yaml
and
eval_config.yaml
. They are for
train.py
and
evaluate.py
respectively. Finally,
training_args.yaml
are the training args that correspond to HuggingFace's
Seq2SeqTrainingArguments
. This file is loaded automatically into train_config.yaml
.
There are a few intricacies/things of note for how configs are parsed (.
indicates hierarchy in yaml configs):
- To set a batch size, set the
training.batch_size
and it will set the corresponding arguments for the huggingface training arguments. They areper_device_train_batch_size
andper_device_eval_batch_size
. - The
model_type
argument fortrain_confing.yaml
isseq2seq
. This will select HuggingFace's AutoModelForSeq2SeqLM
. This means that the model name passed to themodel
argument must be a valid for the correspondingmodel_type
. The currently supportedmodel_types
values are:seq2seq
-->AutoModelForSeq2SeqLM
causal_lm
-->AutoModelForCausalLM
NOTE: This will add in a postprocessor that removes the prompt/context from the generated output.
- To load a checkpoint into training (instead of starting from a HF
checkpoint) set the argument
is_checkpoint=true
.
@article{orlanski2022evaluating,
title={Evaluating How Fine-tuning on Bimodal Data Effects Code Generation},
author={Orlanski, Gabriel and Yang, Seonhye and Healy, Michael},
journal={arXiv preprint arXiv:2211.07842},
year={2022}
}