salesforce / coderl Goto Github PK

This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).

License: BSD 3-Clause "New" or "Revised" License

Python 94.26% Shell 0.26% Jupyter Notebook 0.78% Makefile 0.01% Dockerfile 0.03% Jsonnet 0.01% C 0.02% C++ 0.02% Cuda 0.19% MDX 4.43%

ai codegeneration languagemodel machinelearning programsynthesis reinforcementlearning

coderl's Introduction

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (accepted to NeurIPS 2022). Do check out our blog and poster.

Authors: Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi

CodeRL Overview

An example program synthesis task (Right): Each task includes a problem specification in natural language, which often contains example input and output pairs. The expected output is a program that is checked for functional correctness against some unit tests. A high-level overview of our CodeRL framework for program synthesis (Left): Our CodeRL framework treats pretrained language model (LM) as a stochastic policy, token predictions as actions, and rewards can be estimated based on unit test results of output programs

During training, we treat the code-generating language models as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor.
During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores.

Installation

The code requires some dependencies as specified in requirements.txt. Please follow the relevant libraries to install or run:

pip install -r requirements.txt

Install the transformers library from the source code (the current source code is developed from the original code of version 4.16.1):

cd transformers
pip install -e .

Datasets

For pretraining, apart from the CodeSearchNet (CSN), we use the Python Github Code Dataset (GCPY). We have compiled public, non-personal data from GitHub consisting of permissively licensed Python code (e.g. “mit”, “apache-2”, “bsd-3-clause”, “bsd-2- 126 clause”, “cc0-1.0”, “unlicense”, “isc”). Please see the paper for more details on pretraining data preprocessing and pretraining.

After pretraining, we finetune/evaluate models on the following major program synthesis benchmarks:

APPS: Please follow the downloading and preprocessing instructions provided here.
MBPP: The dataset is available here.

On both benchmarks, we follow the same way of preprocessing data and constructing input/output sequences as the original benchmark papers.

Download and unzip all files into the data folder.

Example Unit Tests

In addition to the original hidden unit tests on APPS, we also utilize the example tests that are often embedded in problem descriptions. After downloading and unzipping APPS, you can run the notebook extract_example_test.ipynb to extract and save example unit tests of APPS test samples into corresponding sample folder e.g. data/APPS/test/0000/. We release the example unit tests that we already extracted using this notebook in the folder data/APPS_test_example_tests/. The average number of example unit tests per sample is 1.9764.

Models

We employ CodeT5 (a family of encoder-decoder language models for code from the paper) as the foundation model in our work.

We pretrained CodeT5 with bigger dataset and improved learning objectives. We release two large-sized CodeT5 checkpoints at Hugging Face: Salesforce/codet5-large and Salesforce/codet5-large-ntp-py.

CodeT5-large: a 770M-CodeT5 model which was pretrained using Masked Span Prediction objective on CSN and achieved new SOTA results on several CodeXGLUE benchmarks. See Appendix A.1 of the paper for more details.
CodeT5-large-ntp-py: A 770M-CodeT5 model which was first pretrained using Masked Span Prediction objective on CSN and GCPY, followed by using Next Token Prediction objective on GCPY. This checkpoint was especially optimized for Python code generation tasks and employed by CodeRL.

For finetuning on downstream code generation tasks on APPS, we adopted critic models for RL training. We released the following critic model checkpoints (on Google Cloud Storage):

CodeT5-finetuned_critic: a CodeT5 model which is initialized from a normal CodeT5-base and trained as a classifier to predict unit test outcomes (one of Compile Error, Runtime Error, Failed Tests, and Passed Tests). The critic is used to estimate returns and facilitate RL finetuning.
CodeT5-finetuned_critic_binary: similar to the prior model but was trained with binary annotations (Passed Tests or not Passed Tests only). This critic is used to facilitate generation procedures during inference.

We released the following finetuned code generation model checkpoints (on Google Cloud Storage):

CodeT5-finetuned_CodeRL: a CodeT5 model which was initialized from the prior pretrained CodeT5-large-ntp-py and then finetuned on APPS following our CodeRL training framework.

Download all files into the models folder.

Processes

Generating Programs

We created scripts/generate.sh to generate programs on the APPS benchmark. You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`model_path`	Path to a trained CodeT5-style model	models/codet5_finetuned_codeRL
`tokenizer_path`	Path to the saved tokenizer for CodeT5 (or path to cache the tokenizer)	models/codet5_tokenizer/
`test_path`	Path to the original test samples	data/APPS/test/
`start`	start index of test samples to be generated	0
`end`	end index of test samples to be generated	5000
`num_seqs`	number of total output programs to be generated (for sampling generation)	1000
`num_seqs_per_iter`	Depending on the limit of GPU, we can generate multiple rounds, each with this number of output programs	50
`temp`	temperature for sampling generation	0.6
`output_path`	Path to save generated programs	outputs/codes/

Other parameters are defined in the file utils/generate_configs.py.

Running the generation script will output programs, each of which is saved into a json file, including data fields code (list of output programs) and prompt (constructed input sequence to the LM model).

Running Unit Tests

Once the programs are generated, they are evaluated against the corresponding unseen unit tests in each problem.

To execute the unit tests and obtain test outcomes, we adapt our code to the official implementation of the APPS benchmark.

We created scripts/run_unit_tests.sh to run unit tests on generated programs on the APPS benchmark. You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`code_path`	Path to the generated programs to be evaluated	outputs/codes/
`output_path`	Path to the save unit test results	outputs/test_results/
`test_path`	Path to the original test samples	data/APPS/test/
`example_tests`	Whether to evaluate the programs on example unit tests (for filtering, refining programs) or hidden unit tests (for final evaluation)	0: use hidden unit tests; 1: use example unit tests
`start`	start index of test samples to be evaluated	0
`end`	end index of test samples to be evaluated	5000
`threads`	Depending on the capacity of the computation resource to run unit tests, we can run unit tests on multiple test samples over multiple threads to speed up the execution time	30

Running the script will output test results for each program. For each test sample, the results are saved into a pickle file, including data fields results (list of test outcomes, one of -2 = compile error, -1 = runtime error, False = failed test case, True = passed test case), errors (real compile error trace with details like error type and line numbers), and sols (corresponding programs being evaluated).

Compared to the original implementation from APPS, we adopt one trick which will exit the unit testing loop if a program does not pass any test case. This will speed up the testing process while the final passing rate measures are not affected. Refer to the run_test function in utils/testing_utils.py for more details.

Evaluating Programs

To compute the pass@k metrics, rather than using the APPS evaluation metrics, we follow the official implementation of the HumanEval benchmark (which better measures pass@k normalized by the number of possible k programs)

Training Critic

We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes (generating programs and running unit tests) to obtain synthetic samples and their annotations of unit test outcomes. On average, we generate 20 programs per training sample (we provided some example generated programs in data/APPS/train/).

Once the programs are tested, we can used their test outcomes as annotations to train a critic model initialized from a LM pretrained on source code data (we used CodeT5-based in this case).

We created scripts/train_critic.sh and scripts/train_critic_deepspeed.sh to train a critic using generated programs. You can directly run this file by configuring the following parameters:

Parameters	Description	Example Values
`batch-size-per-replica`	Number of training samples per GPU device	8
`grad-acc-steps`	Gradient accumulation steps	1
`epochs`	Number of training epochs	10
`lr`	Learning rate	2e-5
`save-freq`	Save model checkpoints after this number of training steps	1000
`log-freq`	Save model training losses after this number of training steps	10
`save_total_limit`	Total number of checkpoints to keep eventually (only the latest ones are kept)	5
`fp16`	Enable this to training model in 16-bit mode to reduce memory usage	N/A
`deepspeed`	If using deepspeed, set this parameter to the configuration file for deepspeed training	configs/deepspeed_configs.json
`db`	Enable this to train in debugging mode i.e. with small dummy data split and only 1 data worker	N/A

Other parameters are defined in the file utils/train_configs.py.

Running the script will train a critic model as a classifier that receives inputs as a problem description + a generated program and returns an output as one of 4 test outcomes: compile error, runtime error, failed tests, and passed tests. The model checkpoints are saved in a folder under exps/.

Generating Critic Scores

We created scripts/generate_critic_scores.sh to generate critic scores for synthetic programs. We use the same parameters as defined in the generating program process with the following additional parameters:

Parameters	Description	Example Values
`critic_scores`	Enable this to run inference on critic models and obtain critic scores	N/A
`gt_solutions`	Enable this to run inference on ground-truth programs; else, synthetic programs are used by default	N/A
`binary_prediction`	Enable this to predict in binary classification i.e. passed tests or failed tests only	N/A

Other parameters are defined in the file utils/generate_configs.py.

Running the generation script will output predictions of the critic model. For each data sample, the prediction is saved into a pkl (pickle) file, including data fields code (list of programs), prompt (constructed input sequence to the critic model), gt_error_type (ground-truth test outcomes), pred_error_type (predicted test outcomes by critic), error_hidden_states (hidden states returned by critic).

Finetuning with Ground-truth Programs

We can finetune any pretraind language model as a program synthesis model that can generate code from problem description in natural language. In our approach, this stage of finetuning is a warmup stage using the ground-truth annotations (from APPS) before a further finetuning stage on synthetic/generated programs.

We created scripts/train_actor.sh and scripts/train_actor_deepspeed.sh which include the parameters as defined above in the critic training process.

Running the script will finetune a pretrained CodeT5-large model that receives a problem description as input and returns a corresponding solution program in Python. The model checkpoints are saved in a folder under exps/.

Finetuning with Generated Programs

We created scripts/train_actor_rl.sh and scripts/train_actor_rl_deepspeed.sh to train pretrained LMs with synthetic generated programs. We use the parameters as defined above in the critic training process with the following additional parameters:

Parameters	Description	Example Values
`model_path`	Path to a finetuned model checkpoint e.g. from warm-up training	models/codet5_finetuned_codeRL
`relative_returns`	Enable this to consider a baseline to compute relative return estimates rather than absolute return restimates in the RL loss	N/A

Other parameters are defined in the file utils/train_configs.py.

Running the script will load a finetuned CodeT5-large model and continue to train it with both generated programs as well as ground-truth programs in alternative training steps. The model checkpoints are saved in a folder under exps/.

Generating Programs with Critic Sampling

We will release the implementation details of our critic sampling procedure.

Example Generated Programs

The problem is from the APPS benchmark, and the solution programs are generated by CodeT5 and CodeRL.

Citation

If you find the paper or the source code useful to your projects, please cite the following bibtex:

@inproceedings{
	le2022coderl,
	title={Code{RL}: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
	author={Hung Le and Yue Wang and Akhilesh Deepak Gotmare and Silvio Savarese and Steven Hoi},
	booktitle={Advances in Neural Information Processing Systems},
	editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
	year={2022},
	url={https://openreview.net/forum?id=WaGvb7OzySA}
}

License

The code is released under BSD 3-Clause - see LICENSE.txt for details.

This code is developed from other open source projects: including APPS, HumanEval, and transformers. We thank the original contributors of these works for open-sourcing their valuable source codes.

coderl's People

Contributors

Stargazers

Watchers

coderl's Issues

problem about run_unit_tests.sh

Hi，thanks for the great work!
I sample 100 candidates for each problem. And when I try to run run_unit_tests.sh，not every problem can get 100 test results.
Some of candidates of a problem may break the test，that causes I can only get dozens of test results of the problem.
Do you know what's wrong with it ? Can you give me some advice?

Question about pre-training process

Hi, I am curious about what validation task you are using during the pre-training process. Could you please share some information about this issue?

Sample Temperature

Hi，thanks for the great work!
I want to ask what sample temperature is codet5_finetuned_codeRL used.
Thanks for your reply

Actor model finetuning code based on reward and policy gradient

Thanks for the great work! Is it possible that you can share the code of whole RL framework finetuning (Actor & Critic updates based on the reward defined in the paper) for better reproducibility? For example, the code of updating Actor network based on reward and policy gradient is missing.

Finetuned model checkpoints

Hi,

Thanks for your amazing work! When do you plan to release finetuned model checkpoints?

Thank you very much!

Run generate.py with the CodeT5-large trained on the ground-truth programs

Hi @henryhungle ,
While waiting for the release of the final RL-finetuned code generator, I would like to run the generator trained on ground-truth programs like here.
However, the training script scripts/train_actor.sh only returns checkpoints. When I check each of the checkpoints, each folder does not have the same files as in the pre-trained model folders.
So the question is: How can I generate programs from those checkpoints?
Thank you.

Bugs for automated example input/output test case extraction

Hi there! CodeRL is a brilliant idea, thanks for the effort!
I have also dealt with the APPS dataset, and I found it hard to extract example test cases in the problem descriptions. After checking your published data, I think your extraction script will fail at some cases.
For example, there are no example test cases extracted for task 4675, 4751, and 4752. Because these problem descriptions have no --- tags. I believe it happens for all the similar tasks:

[2303, 2365, 2466, 2467, 2468, 2469, 2470, 2627, 2628, 2629, 2630, 2631, 2632, 2633, 2634, 2635, 2636, 2637, 2638, 2639, 2647, 2685, 2703, 2708, 2745, 2746, 2747, 2748, 2882, 2883, 2884, 2885, 2887, 4109, 4479, 4480, 4533, 4534, 4535, 4536, 4658, 4659, 4660, 4661, 4662, 4663, 4664, 4665, 4666, 4667, 4668, 4669, 4670, 4671, 4672, 4673, 4674, 4675, 4751, 4752]

Besides, some tasks have non-string input/output test cases, which will not be handled by your script. For example, the extracted test cases for task 4658 is,

{"inputs": [" n = 00000010100101000001111010011100\n", " n = 11111111111111111111111111111101\n"], "outputs": ["    964176192 (00111001011110000010100101000000)\nExplanation: The input binary string 00000010100101000001111010011100 represents the unsigned integer 43261596, so return 964176192 which its binary representation is 00111001011110000010100101000000.\n", "   3221225471 (10111111111111111111111111111111)\nExplanation: The input binary string 11111111111111111111111111111101 represents the unsigned integer 4294967293, so return 3221225471 which its binary representation is 10111111111111111111111111111111. \n"]}

I believe the following tasks may have the same issue.

[2171, 2303, 2307, 2365, 2392, 2466, 2467, 2468, 2469, 2470, 2527, 2528, 2529, 2530, 2531, 2532, 2533, 2534, 2535, 2536, 2627, 2628, 2629, 2630, 2631, 2632, 2633, 2634, 2635, 2636, 2637, 2638, 2639, 2663, 2664, 2665, 2666, 2667, 2668, 2669, 2670, 2671, 2672, 2673, 2674, 2675, 2676, 2677, 2678, 2679, 2680, 2681, 2682, 2683, 2684, 2685, 2686, 2687, 2688, 2689, 2690, 2691, 2692, 2693, 2694, 2695, 2696, 2697, 2698, 2699, 2700, 2701, 2702, 2703, 2704, 2705, 2706, 2707, 2708, 2745, 2746, 2747, 2748, 2882, 2883, 2884, 2885, 2887, 2888, 4479, 4480, 4533, 4534, 4535, 4536, 4658, 4659, 4751, 4752]

Politically correct license description

Hi,

I noticed in the README that a handful of licenses are described as the ones that "at least permit academic use". This doesn't look correct to me; I think maybe some information may have been lost in translation. The licenses listed are ones that permit use in closed-source software. In my country, academic works are usually open source.

It would be polite to people who use copyleft licenses (which require works to be open source and are intended to permit academic use, such as the GPL), if the text were to be clarified.

Aside from that, this work is incredibly inspiring and groundbreaking, and thank you so much for publishing it.

How to generate Critic Scores that can mimic a reward model

Hello, hope all is well,

Wanted to ask how to generate critic scores for a solution of a code problem, is there a way instead of just classifying them using the critic model?

Problems in reproducing the RL fine-tuned results

Hi, thanks for open-sourcing your amazing work!

I have been trying to reproduce the RL fine-tuned results reported in the paper, but unfortunately, I am encountering some issues. Here is a brief overview of the steps I followed:

Fine-tuned the actor model with CE loss for 10 epochs with train_actor.sh and the CodeT5-NTP model. This fine-tuned model gives similar results to the paper (2.86 pass@5 compared to 2.90 in the paper)
With some modifications to generate.py, generated 20 candidate samples per problem (following the sample files given in the repo) and greedy baseline codes for the training set with the CE fine-tuned model. The result key required for the corresponding gen_solutions.json and baseline_solutions.json was generated with this snippet.
Generated the token level hidden states/critic scores with the released critic model through generate_critic_scores.sh.
RL-finetuning with the default hyperparameters present in train_actor_rl.sh, the RL-finetuned model gives very degraded results. (0.84 pass@5)

I would greatly appreciate any suggestions you may have on hyperparameter choices or other settings that could help me reproduce the RL-finetuned results accurately.

Many thanks!

start_idx variable not defined in extract_example_test.ipynb

start_idx variable in line problem_idx += start_idx is not defined in extract_example_test.ipynb

Does `Trainer_Critic` class mimic `transformers`'s `Trainer` class?

At first glance, it seems like the Trainer_Critic class mimic the transformer library's Trainer class. I'm just curious why you felt the need to do that instead of just using the Trainer class?

problems in the critic model results

Hello, I noticed that you have trained a four classification model (Critic).
what are the accuracy, recall, f1_score of the classification model in APPS testset. how to determine whether the critic model is ready？

documentation request for test_one_solution.py

Hi authors, can you add some documentation for the test_one_solution.py file please? I have 2 questions here

what does test_one_solution.py do? (what's the input and output?)
How to use the file? is there a bug or I need to run some other scripts before runing test_one_solution.py?
- the default value for param --test_path (data/APPS/test)doesn't exist
- there is no default value for param --code_path, what should I specify for it?

Thanks in advance!

How to arrange the file ‘models’？？？

I have downloaded all the models mentioned in the README, and for one model, I've created a special folder for it, like "CodeT5-large/". But when I run the script 'generate.sh', I received the “ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.” for this line：
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base', cache_dir=args.tokenizer_path)

Is there anyone can teach me how to arrange the models in the folder 'models'？？

exception of run_unit_tests.sh

Some synthesized programs got the following exceptions when they are evaluated by the run_unit_test.sh, is there any suggestions? Thanks.

0% 0/20 [00:00<?, ?it/s]test framework exception = AttributeError("module 'tmp_sol' has no attribute 'zeroes '")module 'tmp_sol' has no attribute 'zeroes '

RL with execution-based reward

Hi, it seems in the code RL trainer is using the critic model to generate rewards, or is it using offline RL with preloaded rewards?

I wonder if there is a way to use online RL with execution-based rewards, or if it's too slow/unstable in practice. Thanks!

Critic Training pre-processing steps

Hello,

Thanks for making the code for this great project open source, this is really great!

We are using CodeRL as a really nice starting point for student projects, and there are some questions for understanding:
In the "Critic Training" section, you say the following:

We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes (generating programs and running unit tests) to obtain synthetic samples and their annotations of unit test outcomes. On average, we generate 20 programs per training sample (we provided some example generated programs in data/APPS/train/).

You don't explicitly say, but from context I think you are using the CodeT5-large-ntp-py model for this?
What do you mean by "on average" 20 programs per training sample? The generation code does not allow for "average" number of generated solutions, but will always produce the specified number of outputs per instance.
Related to that, when comparing the provided example outputs in data/APPS/train/, we see that all of the solutions provided in the gen_solutions.json files look like "good" code, and sometimes there are less than n=20. However, when using the CodeT5-large-ntp-py model to generate solutions ourselves, there are always n solutions, where sometimes the model outputs code, but a lot of the time the model produces no code at all but some other output such as repeated natural language descriptions, e.g:

print(gen_data['0']['code'][0])
�� the number of words that played the game.


ANSWER:


"""

class Solution(object):
    def reverse(self, n):
        """
        :type n: int
        :rtype: int
        """
        if n == 0:
            return -1
        l = list(bin(n))
        l.reverse()
        return sum(l)

if __name__ == '__main__':
    print Solution().reverse(int(raw_input()))

[...]

print(gen_data['0']['code'][2])
�� the answer.

ANSWER:

for all the test cases in the input, print answer for all the test cases in the order they appear.

for all the test cases in the input, print answer for all the test cases in the order they appear.

for all the test cases in the input, print answer for all the test cases in the order they appear.

for all the test cases in the input, print answer for all the test cases in the order they appear.
[...]

Is there some post-processing going on that we are overlooking?

Critic training problem: Category imbalance in data

Hello, I noticed that you have trained a classification model (Critic). How did you overcome the problem of category imbalance in data? As far as I know, accepted solutions should be few, while wrong answer solutions should be more.

Question about the max-pooling operation.

Hi, congratulations on completing this great work!
I have some questions about the Critic model while reading your code:
In your paper, you mentioned that "The contextual hidden states of the program tokens (h1, . . . , hT ) obtained from the critic model decoder are max-pooled along the sequence length dimension." However, in your code, you did not perform max-pooling along the sequence length dimension on h(contextual hidden states, last dim size is: config.d_model), but on error_states(last dim size is: 4) by taking the maximum value along the first dimension.

Location:
the problem code in CodeRL/transformers/src/transformers/models/t5/modeling_t5.py
class: T5ForConditionalGeneration
method: forward
self.error_head = nn.Sequential(
nn.Linear(config.d_model, 128),
nn.ReLU(),
nn.Linear(128, 4)
)

if error_types is not None:
error_states = self.error_head(sequence_output)
error_logits, _ = error_states.max(1)
error_pred_loss_fct = CrossEntropyLoss()
error_pred_loss = error_pred_loss_fct(error_logits.view(-1, error_logits.size(-1)), error_types.view(-1))
_, error_preds = torch.max(error_logits, dim=-1)
if return_error_hidden_states:
return error_pred_loss, error_preds, error_states
return error_pred_loss, error_preds

what is the super-parameters for RL training

Hi, thanks for the nice job. I try to reproduce the result reported in the paper. However, I didn't find the detail about the training parameters (eg. learning rate, number of epoch) of second stage fine-tune (RL). I train RL with the same parameters as the first stage fine-tune (SL), but the performance degrade a lot. I think it is due to the wrong super-parameters. Could you share the detail about that? Thanks in advance.

CodeT5 input for APPS/MBPP problems

Hi, I wonder what's the exact input formats for APPS/MBPP problems to be fed into CodeT5-large-ntp-py or CodeT5-finetuned_CodeRL? I tried """{Problem}""" but it doesn't work well, generating a lot of comments or natural language outputs.

Would appreciate an example for each dataset as they are not found in repo/paper. Thanks!

When will you release the implementation details of your critic sampling procedure please?

Any updates on Generating Programs with Critic Sampling?

@henryhungle This is a really nice work that is briging the gap between SE and AI!

However, it seems incomplete yet. In the README, you said that "We will release the implementation details of our critic sampling procedure." Any news on this part?

Question about the input of the critic model

Hi, as you mention in the paper, the input of the critic model should include 'Ws' and 'D', where 'Ws' is the sampled program and 'D' is the problem description. However, I didn't find any operation that mixing 'Ws' to the input text in APPSBaseDataset or function generate_prompt. Is there anything I missed? If not, how would this difference affect the performance of the model?
Sincerely looking forward to your reply！

Datasets for train_actor_rl.sh

Dear @henryhungle ,
According to the script train_actor_rl.sh and trainer_rl.py, we have to prepare several JSON/pickle files like gen_solutions_critic_scores.pkl and/or baseline_solutions.json. Also, I found that we don't have gen_solutions.json as well. How can I generate them?
I found that we can generate gen_solutions_critic_scores.pkl by running generate_critic_scores.sh without the gt_solutions flag. However, that one requires gen_solutions.json which I can't find any scripts that do the job. I got stuck at this.
I think we can generate baseline_solutions or gen_solutions just like the normal generating program procedure, right?
Thanks.

Performance Results on HumanEval

I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?