GithubHelp home page GithubHelp logo

codet5's Introduction

RACE: Retrieval-Augmented Commit Message Generation

The architecture Model

We propose a novel model RACE, which retrieves a similar commit message as an exemplar, guides the neural model to learn the content of the code diff and the intent behind the code diff, and generates the readable and informative commit message. Specifically, our model includes two modules: retrieval module and generation module. Specifically, RACE firstly retrieves the most semantically similar code diff paired with the commit message from the large parallel training corpus. The semantic similarity between two code diffs is measured by the cosine similarity of vectors obtained by a code diff encoder. Next, RACE treats the retrieved commit message as an example and uses it to guide the neural network to generate an understandable and concise commit message.

1

Environment

conda create -n RACE python=3.6 -y
conda activate RACE
pip install torch==1.10 transformers==4.12.5 tqdm==4.64.1 prettytable==2.5.0 gdown==4.5.1 more-itertools==8.14.0 tensorboardX==2.5.1 setuptools==59.5.0  tensorboard==2.10.1

Dataset

The dataset MCMD has five programming languages (PL): Java, C#, Cpp, Python and JavaScript. The dataset can be downloaded here. More info about MCMD can be found here. We use the filtered dataset in our work.

Statistics of dataset

language Training Valid Test
Java 160,018 19,825 20,159
C# 149,907 18,688 18,702
Cpp 160,948 20,000 20,141
Python 206,777 25,912 25,837
JavaScript 3197,529 24,899 24,773

Use the following commands to download and unzip the downloaded dataset.

wget https://zenodo.org/record/7196966/files/dataset.tar.gz
tar zxvf dataset.tar.gz

It will take about 1 min.

  • The orginal data is saved in dataset/java/.
  • The processed data is saved in dataset/java/contextual_medits/.
  • The retrievae data is saved in dataset/java/contextual_medits/codet5_retrieval_result.

Training

language=java
bash run.sh $language 

Evaluation

python evaluate.py  --refs_filename  [The path of the reference file] --preds_filename [The path of the predicted file]

For example,

lang=javascript
python evaluate.py  --refs_filename results/${lang}/test.gold  --preds_filename   results/${lang}/test.pred

Output

BLEU:    25.66
Meteor:  15.46
Rouge-L: 32.02
Cider:   1.76

Results

Language Result Dir
Java results/java/test.pred
C# results/csharp/test.pred
Cpp results/cpp/test.pred
Python results/python/test.pred
JavaScript results/javascript/test.pred

codet5's People

Contributors

zhaospei avatar

Watchers

 avatar

Forkers

ducanger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.