GithubHelp home page GithubHelp logo

h4iku / t5apr Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 1.0 19.56 MB

Repository for the paper "T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble."

Home Page: https://www.sciencedirect.com/science/article/abs/pii/S0164121224001286

License: MIT License

Python 100.00%
program-repair paper

t5apr's Introduction

T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble

arXiv DOI:10.1016/j.jss.2024.112083 Hugging Face Collection

T5APR overview

T5APR is a novel approach to automated program repair (APR) that leverages a multilingual transformer model based on CodeT5 along with a checkpoint ensemble strategy to generate patches for programs written in Python, Java, C, and JavaScript.

Results Structure

The โ€results directory contains generated plausible patches for each benchmark. The outputs-multi directories hold patches generated by the multilingual model, while the outputs-{language} directories contain patches produced by models trained on a single language.

How to Use

Set Up

  1. Install Python 3.9 or higher and clone this repository with its submodules:

    git clone --recurse-submodules https://github.com/h4iku/T5APR.git
    cd T5APR
  2. Create a virtual environment and install the dependencies:

    python -m venv .venv
    source .venv/bin/activate
    
    python -m pip install -U pip setuptools
    pip install -r requirements.txt
  3. Prepare evaluation benchmarks and tree-sitter language grammars:

    Place the evaluation benchmarks in the benchmarks directory. Repositories for QuixBugs, Defects4J, Bears, and BugAID are already there. Defects4J needs further steps to install:

    cd benchmarks/Defects4J
    cpanm --installdeps .
    ./init.sh

    For further information, follow Defects4J set up instructions.

    For the Codeflaws benchmark, download codeflaws.tar.gz archive and extract it in the benchmarks/Codeflaws directory. For ManyBugs benchmark, the necessary files are in benchmarks/ManyBugs.7z, which you can directly extract in the benchmarks directory, but you can also download the complete scenario tarballs and extract them in the benchmarks/ManyBugs/scenarios directory.

    Submodules for tree-sitter language grammars are in the tools/tree-sitter-lib/vendor directory, and the compiled library will be in the tools/tree-sitter-lib/build. If you didn't download the submodules, you can follow the tree-sitter instructions to clone the required language grammars into the same directory.

  4. To run each module, navigate to the root of the repository and execute the following command:

    python -m package.module

    For example, to run the src/bugline_finders/quixbugs_python.py module:

    python -m src.bugline_finders.quixbugs_python

    To run tests:

    python -m pytest tests

Model Training

Use the train_model.py script to fine-tune the pre-trained model. It downloads necessary model and datasets from ๐Ÿค— Hugging Face Hub and caches them in the project's .cache directory. By default, it uses codet5-small as the base model and trains all the languages together in the multitask learning setting. You can fine-tune monolingual models by changing the model_type: ModelType = ModelType.MULTI variable.

After fine-tuning, you will have five checkpoints in the models/codet5-small-t5apr-{multi|java|python|c|javascript} directory based on the model_type you chose. The fine-tuned checkpoints used in the paper's experiments are uploaded on the ๐Ÿค— Hub, and you can use download_checkpoints.py or extract_checkpoints.py scripts to get them. download_checkpoints.py is faster and uses the huggingface_hub client to directly download checkpoints from ๐Ÿค— Hub and doesn't need Git LFS installed. extract_checkpoints.py needs GIT LFS to work (git lfs install). It first clones the model repository and then incrementally extracts checkpoints from its commits. In both scripts, you can change the downloading checkpoints by changing the repo_name variable. The default is the multilingual model codet5-small-t5apr-multi.

The training data is also on the ๐Ÿค— Hub, and there are both raw and preprocessed versions. If you want to locally preprocess the data obtained from CoCoNuT repository, you can download and follow their instructions to put them in the data directory (The uncompressing process will take some time!) and use preprocess.py to preprocess it. The preprocessed data will be in the same directory.

Finding Bug Lines & Patch Generation

Scripts in the bugline_finders directory are used to extract buggy lines and other metadata from bugs in the evaluation benchmark programs by comparing buggy and correct diffs. The outputs will be saved in the generated_assets folder under the name of each benchmark.

To generate primary candidate patches for the previously extracted buggy lines use generate_primary_candidates.py. You can change the dataset value to generate candidate patches for different benchmarks. The Boolean variable multi decides if you want to generate candidate patches using the multilingual checkpoints or the language-specific monolingual ones. The output will be saved in a file named sequences_{beam_size}.jsonl in the outputs-multi or outputs-{language} directory depending on the model you choose.

The sequences_{beam_size}.jsonl file has the candidate patches generated from all the checkpoints. You can then use combine_checkpoints_results.py to combine checkpoint patches into a single list of candidate patches resulting in the file final_candidates_{beam_size}.jsonl.

Patch Validation

To validate candidate patches for each benchmark, use scripts in the validators folder. Change the output_dir variable to choose if you want validation to run on candidate patches from the multilingual or monolingual checkpoints. Validation result of each bug is saved in the save-state folder under the bugid filename. Execution of these modules are all resumable, and if they get interrupted in the middle of execution, they will continue validating the remaining bugs next time they run. Final aggregate results of plausible patches are saved in plausible_candidates_{beam_size}.jsonl file.

At this point after validation, you can use separate_d4j_versions.py script to generate separate folders for bugs in the v1.2 and v2.0 of Defects4J benchmark.

You can rerank patches in final results using the rerank_patches.py script where it gives you reranked_candidates_{beam_size}.jsonl that you can then use generate_results.py to generate plausible results in the results directory and start the manual patch assessment process. In the assessment process make sure to change the variable assessment to True, so it generates results from reranked file. With every change you make to the items in the results folder, you can rerun rerank_patches.py script to write the assessments back to the reranked_candidates_{beam_size}.jsonl file. The validated_reranked_candidates_100.jsonl is a copy of reranked_candidates_100.jsonl that is updated with our patch assessments and is the file we generated the results directory from.

Misc

The location of the directories and paths where files are written or read from are in configs.py.

The complete content of the generated_assets directory for all the benchmarks is available here.

Evaluation

TBA

Citation

If you use T5APR in your research, please cite the following paper:

@article{gharibiT5APREmpoweringAutomated2024,
  title = {T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble},
  shorttitle = {T5APR},
  author = {Gharibi, Reza and Sadreddini, Mohammad Hadi and Fakhrahmad, Seyed Mostafa},
  year = {2024},
  journal = {Journal of Systems and Software},
  volume = {214},
  pages = {112083},
  doi = {10.1016/j.jss.2024.112083},
}

t5apr's People

Contributors

h4iku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

hzxin

t5apr's Issues

Unable to train the model, there is an issue with the dataset

After running "python train_model.py," I encountered the following issue. When I directly access this URL, it returns "Entry not found.

#python train_model.py 
CUDA available: True
CUDA version: 11.8
Traceback (most recent call last):
  File "train_model.py", line 31, in <module>
    raw_datasets = {
  File "train_model.py", line 32, in <dictcomp>
    prefix: load_dataset(data, split="train") for prefix, data in dataset_names.items()
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 1512, in dataset_module_factory
    raise e1 from None
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 1489, in dataset_module_factory
    return HubDatasetModuleFactoryWithoutScript(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/load.py", line 1031, in get_module
    dataset_readme_path = cached_path(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 182, in cached_path
    output_path = get_from_cache(
  File "/root/miniconda3/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 601, in get_from_cache
    raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")
ConnectionError: Couldn't reach https://huggingface.co/datasets/h4iku/coconut_java2006_preprocessed/resolve/76a94c2ccfd933177a62d8d5a5eea4ce4b84af34/README.md (error 503)

Request for Missing 'tools' Folder

I noticed that the project seems to be missing the 'tools' folder, and when running the code, I encountered a FileNotFoundError for the file /root/T5APR-main/tools/tree-sitter-lib/vendor/tree-sitter-python/src/parser.c. It appears that the 'tools' folder is necessary for the correct functioning of the code.
Could you please share the missing 'tools' folder or provide instructions on how to obtain it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.