The bimodal-code-generation from gabeorlanski

bimodal-code-generation's Introduction

Code for the paper Evaluating How Fine-tuning on Bimodal Data Effects Code Generation

Install Instructions

Clone this repo with

git clone https://github.com/gabeorlanski/springresearch.git

Install the requirements with

pip install -r requirements.txt

Install these python libraries from their repositories:

TaskIO
Apex

Configs

Configs for this project use the Hydra Framework. The main configs are located in the conf directory . The two most important ones are train_config.yaml and eval_config.yaml . They are for train.py and evaluate.py respectively. Finally, training_args.yaml are the training args that correspond to HuggingFace's Seq2SeqTrainingArguments . This file is loaded automatically into train_config.yaml.

There are a few intricacies/things of note for how configs are parsed (. indicates hierarchy in yaml configs):

To set a batch size, set the training.batch_size and it will set the corresponding arguments for the huggingface training arguments. They are per_device_train_batch_size and per_device_eval_batch_size.
The model_type argument for train_confing.yaml is seq2seq. This will select HuggingFace's
AutoModelForSeq2SeqLM . This means that the model name passed to the model argument must be a valid for the corresponding model_type. The currently supported model_types values are:
- seq2seq --> AutoModelForSeq2SeqLM
- causal_lm --> AutoModelForCausalLM NOTE: This will add in a postprocessor that removes the prompt/context from the generated output.
To load a checkpoint into training (instead of starting from a HF checkpoint) set the argument is_checkpoint=true.

Citation

@article{orlanski2022evaluating,
  title={Evaluating How Fine-tuning on Bimodal Data Effects Code Generation},
  author={Orlanski, Gabriel and Yang, Seonhye and Healy, Michael},
  journal={arXiv preprint arXiv:2211.07842},
  year={2022}
}

Recommend Projects