hendrycks / apps Goto Github PK

View Code? Open in Web Editor NEW

377.0 377.0 50.0 45 KB

APPS: Automated Programming Progress Standard (NeurIPS 2021)

License: MIT License

Python 98.50% Shell 1.50%

code-generation program-synthesis

apps's Introduction

I'm Dan, a PhD student in ML at UC Berkeley.

See my webpage for my research.

See below for my code.

apps's People

Stargazers

Watchers

apps's Issues

answer_type calculation is different for train/val and eval

Not necessarily an issue, but I noticed that for train/val, the answer_type is based on whether starter_code exists but that at eval time, it's based on fn_name. Is there a reason for this difference?

Can this dataset test for chatgpt?(gpt 3.5?)

hi there! I am currently doing my thesis on Chatgpt. the main aim is to evaluate the program ability of chatgpt, I wonder if this dataset can be used in conversational form, like giving a prompt to gpt and returning the code to evaluate.

check5 in function "run_test" seem to bring some wrong result

Hi, thanks for your work.
I don't quite understand the role of check5 in the evaluating process, it seems to bring some wrong results. Here is an example of 4496 test problem.
The question is:

My program is:

When I pass 22 into the program, the ideal return result is “Christmas Eve Eve Eve”, but this program returns “Christmas Eve”. Obviously, this is a wrong answer, but check5 in the “run_test” function judges the result as correct.

Is it a bug? Looking forward to your reply.

Request for scripts of fine-tuning

Hi, thanks for the amazing work! I really appreciate that you released the dataset, but now I wanna apply it to other models downloaded from Hugging Face. I wonder if I can get the scripts of fine-tuning?

Computation of the accuracy scores when there are compilation and runtime errors

Hi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this

apps/eval/test_one_solution.py

Lines 22 to 42 in c55cce3

 def print_results(results, args): 

 res = [] 

 per_prob_res = [] 

 all_correct = [] 

 for index in results: 

 res.extend(results[index]) 

 per_prob_res.append(np.mean(results[index])) 

 all_correct.append(np.all(results[index])) 

 tmp_results = res 

 compile_errors = len(tmp_results[tmp_results==-2]) 

 runtime_errors = len(tmp_results[tmp_results==-1]) 

 failures = len(tmp_results[tmp_results==False]) 

 successes = len(tmp_results[tmp_results==True]) 

 total_testcases = len(res) 

 if args.debug: 

 print(f"number of compile errors = {compile_errors} avg = {compile_errors / total_testcases }") 

 print(f"number of runtime errors = {runtime_errors} avg = {runtime_errors / total_testcases}") 

 print(f"number of test cases run = {total_testcases}") 

 print(f"Test Case Average (average accuracy over problems) = {np.mean(per_prob_res)}") 

 print(f"Strict Accuracy (all test cases passed / total problems) = {np.mean(all_correct)}")

I was curious why you use -2 and -1 for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.

Also the expression all_correct.append(np.all(results[index])) will consider -2 and -1 as True since np.all evaluates non zero numbers to True, which could give a false accuracy.

Below is an example:

print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)

number of compile errors = 1 avg = 0.25
number of runtime errors = 1 avg = 0.25
number of test cases run = 4
Test Case Average (average accuracy over problems) = -2.0
Strict Accuracy (all test cases passed / total problems) = 1.0

Another thing regarding the expressions:

 compile_errors = len(tmp_results[tmp_results==-2])
 runtiome_errors = len(tmp_results[tmp_results==-1])

if I'm not mistaken this doesn't work (at least on Python 3.9), another implementation could be

 compile_errors = len([e for e in tmp_results if -2 in e])
 runtiome_errors = len([e for e in tmp_results if -1 in e])

Problem in ground-truth solutions

Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:

Test Case Average (average accuracy over problems) = 0.27318586602648753
Strict Accuracy (all test cases passed / total problems) = 0.263

Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?

Thank you in advance for any help!

Show a data instance in the readme

It would be nice to see the data schema before downloading the full dataset, to the benefit of those who might have to write a parser.

About Solutiions' validity

Thanks for creating this amazing datasets.
I want to ask if any solution in Solutions.json is strictly correct (can pass all unit tests in input_output.json )?
Do you check it?
If not，how do you ensure the correctness of the solution?
Thanks!

Wrong arXiv link in main read me file

Paper is easy to find anyway, but just in case you want to fix :)

Request for pretrained models

Hey there!
Congrats and thanks for the amazing work! The APPS dataset would benefit the community greatly.
I really appreciate that you released the GPT2-1.5B finetuned model, but just curious would it possible to release the pretrained GPT2-1.5B model as well?

Thank you in advance and happy new year!

DeepSpeed config and TrainingArguments mismatch

Hi, I'm trying to run finetuning to replicate the results in the paper but am getting an error from a mismatch in hyperparameters between deepspeed_config.json and what's specified in tune_apps_gpt.py (e.g. an LR of 1e-4 in deepspeed_config.json, but 5e-5 in tune_apps_gpt.py).

Could you give any guidance on which to use?

The error I'm getting is:

Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds train_batch_size=8 vs hf train_batch_size (calculated)=128
- ds optimizer.params.lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_num_steps=500 vs hf warmup_steps=0
The easiest method is to set these DeepSpeed config values to 'auto'.

and the command is

USE_TF=NO deepspeed tune_apps_gpt.py  \
  --save-dir=${save_dir}  \
  --arch=EleutherAI/gpt-neo-2.7B \
  --apps-train-files ../data/train \
  --apps-dataroot ../data/train/ \
  --grad-acc-steps=8 \
  --epochs=10 \
  --fp16 \
  --deepspeed deepspeed_config.json \
  --batch-size-per-replica=2 \
  | tee ${save_dir}/log.out

Thanks!

Steps About Generated Code Solutions Post-processing

Hi，thanks for the amazing work!
I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution.
(e.g. After a code solution was generated, did you truncate it by stop tokens?(e.g. : “\nclass”, “\ndef”, “\n#”))
Thanks for your reply!

Asking for scripts for pre-processing

Dear Hendrycks,

   Thanks for providing the amazing work!
   Recently, I want to fine-tune the model on my collected code data. Could you provide an exemplar custom HTLM parser (Paragraph **Dataset Construction** under Section 3) on how you pre-process the HTML description file? I want to keep the description format constant.
   Thanks in advance.

Zhenfang

evaluation on multiple solutions at once causes memory leak

Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at once?
I have added the metric to the HuggingFace hub https://huggingface.co/spaces/codeparrot/apps_metric (I didn’t change the core script testing_util.py) with evaluation done for all solutions at once and I sometimes get a memory leak for which I can’t identify the source because when I do the evaluation on the same solutions separately this doesn’t happen.

Below is the code that causes memory saturation:

from evaluate import load

generations = [["s = input()\nn = len(s)\nm = 0\n\nfor i in range(n):\n\tc = s[i]\n\tif c == '|':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\telif c == '\\n':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\nif m < 2:\n\tprint(-1)\nelse:\n\tprint(m * 2 - 1)\n"], ["\nx = int(input())\n\nl = list(range(x+1))\n\nm = next(l)\n\ns = sum(list([int(i) for i in str(m)]))\n\nif s > sum(list([int(i) for i in str(m)])) :\n\tm = next(l)\n\t\nprint(m)\n"]]

metric = load("codeparrot/apps_metric")

results = metric.compute(predictions=generations, level="all", debug=False)

While this works fine:

generation_1 = generations[:1]
generation_2 = generations[1:2]
results_1 = metric.compute(predictions=generation_1, level="all", debug=False)
results_2 = metric.compute(predictions=generation_2, level="all", debug=False)
print(results_1)
print(results_2)

{'avg_accuracy': 0.23185840707964603, 'strict_accuracy': 0.0, 'pass_at_k': None}
{'avg_accuracy': 0.0, 'strict_accuracy': 0.0, 'pass_at_k': None}

Why don't add sorted(problems) to pin problems sorting?

Hi, thanks for your hard works.

Whether should we add a line of code problems = sorted(problems) here to pin some ordering? Just like:

apps/eval/generate_gpt_codes.py

Line 113 in 473a497

problems = sorted(problems) # Pin some ordering

If I have any mistake, please point out.

Thanks

Running instructions

Reference: https://github.com/hendrycks/apps#how-to-use

Files train/README and eval/README are not present in the repository.
Would really appreciate if the instructions are added for training and evaluation of the models.

GPT-2 0.1B weights

Hello, Could you provide weights of the fine-tuned GPT-2 0.1B mentioned in the paper?

Categorization of Problem Difficulty

Hi, thanks for providing the benchmark dataset. From the paper, problems are categorized into Introductory, Interview, Competition, etc. Could you provide the problem ids corresponding to the difficulty level? It would help a lot.

Test case average of solutions in real dataset

Hi, Have you checked the ground truth solutions in the original dataset to make sure that they pass the test cases?

Unable to run pre-trained (1.5B) model on test set

I'm trying to run the pre-trained 1.5B model linked in the README on the APPS test set. I downloaded the dataset and ran the script train/apps_create_split.py on it, then ran the model with

python generate_gpt_codes.py -t ~/Code/APPS/test.json --load ~/Code/APPS/models/1.5B --save ~/Code/APPS/output/15B

Note that I didn't do training beforehand, the directory models/1.5B is as it was when I downloaded it - I assume this is fine since the README says the models are fine-tuned.

When I look at the contents of all_codes.json, at first it looks okay, but pretty soon all I see are empty entries like this:

... "9": "", "10": "", "11": "", "12": "", "13": "", "14": "" ...

I see several messages in the script output that seem like potential errors:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Input length of input_ids is 1052, but `max_length` is set to 1023. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [207,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Many of those errors are printed over and over, and then the end of the logs are just this message thousands of times:

Unexpected exception in generating solution
Batch dimension of `input_ids` should be 5, but is 4.

Problems With APPS

Hi. When using the dataset to evaluate the fine-tuned model, I found that the test set has some problems without solution.json. Could you provide the complete set?

Broken dataset download link

Hi, The download link in readme for dataset is currently broken.

Too Long Problems

There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?

Problems with fine-tuning

Thanks for the response for the last question. I have generated reasonable results using the provided models and code. I trained my model with reference to the parameters given in tune_apps_gpt.py , but the results are not so good because the datasets used before are quite different with APPS. And I also find that the parameters in tune_apps_gpt.py are different from the parameters in deepspeed_config.json. The same problem has occurred in some other issue. Can I ask if there is a uniform answer？

Do I need to still reindent if I'm using the APPS dataset hosted on HuggingFace?

I'm browsing through the APPSBaseDataset code and noticed that the code is reindented to match the format of the Github dataset. If I'm using the APPS dataset from HuggingFace, do I still need to do the reindentation before fine-tuning a model on it?

Missing apps-train-files json file?

Hi,

Thank you for releasing this amazing codebase! I found that the appsdata need to take apps-train-files json file as an input but I couldn't find anything in the provided apps dataset. I wonder if I am missing somewhere.

Thanks!

Nan test case average

Hello,
I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result of which calculating mean in print_results() gives nan. How should I handle this case?

	def print_results(results, args):
	res = []
	per_prob_res = []
	all_correct = []
	for index in results:
	res.extend(results[index])
	per_prob_res.append(np.mean(results[index]))
	all_correct.append(np.all(results[index]))
	tmp_results = res
	compile_errors = len(tmp_results[tmp_results==-2])
	runtime_errors = len(tmp_results[tmp_results==-1])
	failures = len(tmp_results[tmp_results==False])
	successes = len(tmp_results[tmp_results==True])
	total_testcases = len(res)
	if args.debug:
	print(f"number of compile errors = {compile_errors} avg = {compile_errors / total_testcases }")
	print(f"number of runtime errors = {runtime_errors} avg = {runtime_errors / total_testcases}")
	print(f"number of test cases run = {total_testcases}")

	print(f"Test Case Average (average accuracy over problems) = {np.mean(per_prob_res)}")
	print(f"Strict Accuracy (all test cases passed / total problems) = {np.mean(all_correct)}")

hendrycks / apps Goto Github PK

apps's Introduction

apps's People

Stargazers

Watchers

Forkers

apps's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs