GithubHelp home page GithubHelp logo

hendrycks / apps Goto Github PK

View Code? Open in Web Editor NEW
377.0 377.0 50.0 45 KB

APPS: Automated Programming Progress Standard (NeurIPS 2021)

License: MIT License

Python 98.50% Shell 1.50%
code-generation program-synthesis

apps's Introduction

I'm Dan, a PhD student in ML at UC Berkeley.

See my webpage for my research.

See below for my code.

apps's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apps's Issues

Can this dataset test for chatgpt?(gpt 3.5?)

hi there! I am currently doing my thesis on Chatgpt. the main aim is to evaluate the program ability of chatgpt, I wonder if this dataset can be used in conversational form, like giving a prompt to gpt and returning the code to evaluate.

check5 in function "run_test" seem to bring some wrong result

Hi, thanks for your work.
I don't quite understand the role of check5 in the evaluating process, it seems to bring some wrong results. Here is an example of 4496 test problem.
The question is:
image
My program is:
image
When I pass 22 into the program, the ideal return result is “Christmas Eve Eve Eve”, but this program returns “Christmas Eve”. Obviously, this is a wrong answer, but check5 in the “run_test” function judges the result as correct.
image
Is it a bug? Looking forward to your reply.

Request for scripts of fine-tuning

Hi, thanks for the amazing work! I really appreciate that you released the dataset, but now I wanna apply it to other models downloaded from Hugging Face. I wonder if I can get the scripts of fine-tuning?

Computation of the accuracy scores when there are compilation and runtime errors

Hi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this

def print_results(results, args):
res = []
per_prob_res = []
all_correct = []
for index in results:
res.extend(results[index])
per_prob_res.append(np.mean(results[index]))
all_correct.append(np.all(results[index]))
tmp_results = res
compile_errors = len(tmp_results[tmp_results==-2])
runtime_errors = len(tmp_results[tmp_results==-1])
failures = len(tmp_results[tmp_results==False])
successes = len(tmp_results[tmp_results==True])
total_testcases = len(res)
if args.debug:
print(f"number of compile errors = {compile_errors} avg = {compile_errors / total_testcases }")
print(f"number of runtime errors = {runtime_errors} avg = {runtime_errors / total_testcases}")
print(f"number of test cases run = {total_testcases}")
print(f"Test Case Average (average accuracy over problems) = {np.mean(per_prob_res)}")
print(f"Strict Accuracy (all test cases passed / total problems) = {np.mean(all_correct)}")
I was curious why you use -2 and -1 for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.

Also the expression all_correct.append(np.all(results[index])) will consider -2 and -1 as True since np.all evaluates non zero numbers to True, which could give a false accuracy.

Below is an example:

print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)
number of compile errors = 1 avg = 0.25
number of runtime errors = 1 avg = 0.25
number of test cases run = 4
Test Case Average (average accuracy over problems) = -2.0
Strict Accuracy (all test cases passed / total problems) = 1.0

Another thing regarding the expressions:

 compile_errors = len(tmp_results[tmp_results==-2])
 runtiome_errors = len(tmp_results[tmp_results==-1])

if I'm not mistaken this doesn't work (at least on Python 3.9), another implementation could be

 compile_errors = len([e for e in tmp_results if -2 in e])
 runtiome_errors = len([e for e in tmp_results if -1 in e])

Problem in ground-truth solutions

Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:

Test Case Average (average accuracy over problems) = 0.27318586602648753
Strict Accuracy (all test cases passed / total problems) = 0.263

Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?

Thank you in advance for any help!

Screenshot 2024-04-29 at 14 35 30

Show a data instance in the readme

It would be nice to see the data schema before downloading the full dataset, to the benefit of those who might have to write a parser.

About Solutiions' validity

Thanks for creating this amazing datasets.
I want to ask if any solution in Solutions.json is strictly correct (can pass all unit tests in input_output.json )?
Do you check it?
If not,how do you ensure the correctness of the solution?
Thanks!

Request for pretrained models

Hey there!
Congrats and thanks for the amazing work! The APPS dataset would benefit the community greatly.
I really appreciate that you released the GPT2-1.5B finetuned model, but just curious would it possible to release the pretrained GPT2-1.5B model as well?

Thank you in advance and happy new year!

DeepSpeed config and TrainingArguments mismatch

Hi, I'm trying to run finetuning to replicate the results in the paper but am getting an error from a mismatch in hyperparameters between deepspeed_config.json and what's specified in tune_apps_gpt.py (e.g. an LR of 1e-4 in deepspeed_config.json, but 5e-5 in tune_apps_gpt.py).

Could you give any guidance on which to use?

The error I'm getting is:

Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds train_batch_size=8 vs hf train_batch_size (calculated)=128
- ds optimizer.params.lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_num_steps=500 vs hf warmup_steps=0
The easiest method is to set these DeepSpeed config values to 'auto'.  

and the command is

USE_TF=NO deepspeed tune_apps_gpt.py  \
  --save-dir=${save_dir}  \
  --arch=EleutherAI/gpt-neo-2.7B \
  --apps-train-files ../data/train \
  --apps-dataroot ../data/train/ \
  --grad-acc-steps=8 \
  --epochs=10 \
  --fp16 \
  --deepspeed deepspeed_config.json \
  --batch-size-per-replica=2 \
  | tee ${save_dir}/log.out

Thanks!

Steps About Generated Code Solutions Post-processing

Hi,thanks for the amazing work!
I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution.
(e.g. After a code solution was generated, did you truncate it by stop tokens?(e.g. : “\nclass”, “\ndef”, “\n#”))
Thanks for your reply!

Asking for scripts for pre-processing

Dear Hendrycks,

   Thanks for providing the amazing work!
   Recently, I want to fine-tune the model on my collected code data. Could you provide an exemplar custom HTLM parser (Paragraph **Dataset Construction** under Section 3) on how you pre-process the HTML description file? I want to keep the description format constant.
   Thanks in advance.

Zhenfang

evaluation on multiple solutions at once causes memory leak

Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at once?
I have added the metric to the HuggingFace hub https://huggingface.co/spaces/codeparrot/apps_metric (I didn’t change the core script testing_util.py) with evaluation done for all solutions at once and I sometimes get a memory leak for which I can’t identify the source because when I do the evaluation on the same solutions separately this doesn’t happen.

Below is the code that causes memory saturation:

from evaluate import load

generations = [["s = input()\nn = len(s)\nm = 0\n\nfor i in range(n):\n\tc = s[i]\n\tif c == '|':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\telif c == '\\n':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\nif m < 2:\n\tprint(-1)\nelse:\n\tprint(m * 2 - 1)\n"], ["\nx = int(input())\n\nl = list(range(x+1))\n\nm = next(l)\n\ns = sum(list([int(i) for i in str(m)]))\n\nif s > sum(list([int(i) for i in str(m)])) :\n\tm = next(l)\n\t\nprint(m)\n"]]

metric = load("codeparrot/apps_metric")

results = metric.compute(predictions=generations, level="all", debug=False)

While this works fine:

generation_1 = generations[:1]
generation_2 = generations[1:2]
results_1 = metric.compute(predictions=generation_1, level="all", debug=False)
results_2 = metric.compute(predictions=generation_2, level="all", debug=False)
print(results_1)
print(results_2)
{'avg_accuracy': 0.23185840707964603, 'strict_accuracy': 0.0, 'pass_at_k': None}
{'avg_accuracy': 0.0, 'strict_accuracy': 0.0, 'pass_at_k': None}

GPT-2 0.1B weights

Hello, Could you provide weights of the fine-tuned GPT-2 0.1B mentioned in the paper?

Categorization of Problem Difficulty

Hi, thanks for providing the benchmark dataset. From the paper, problems are categorized into Introductory, Interview, Competition, etc. Could you provide the problem ids corresponding to the difficulty level? It would help a lot.

Unable to run pre-trained (1.5B) model on test set

I'm trying to run the pre-trained 1.5B model linked in the README on the APPS test set. I downloaded the dataset and ran the script train/apps_create_split.py on it, then ran the model with

python generate_gpt_codes.py -t ~/Code/APPS/test.json --load ~/Code/APPS/models/1.5B --save ~/Code/APPS/output/15B

Note that I didn't do training beforehand, the directory models/1.5B is as it was when I downloaded it - I assume this is fine since the README says the models are fine-tuned.

When I look at the contents of all_codes.json, at first it looks okay, but pretty soon all I see are empty entries like this:

... "9": "", "10": "", "11": "", "12": "", "13": "", "14": "" ...

I see several messages in the script output that seem like potential errors:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 1052, but `max_length` is set to 1023. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [207,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Many of those errors are printed over and over, and then the end of the logs are just this message thousands of times:

Unexpected exception in generating solution
Batch dimension of `input_ids` should be 5, but is 4.

Problems With APPS

Hi. When using the dataset to evaluate the fine-tuned model, I found that the test set has some problems without solution.json. Could you provide the complete set?
image

Too Long Problems

1668943017588
There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?

Problems with fine-tuning

Thanks for the response for the last question. I have generated reasonable results using the provided models and code. I trained my model with reference to the parameters given in tune_apps_gpt.py , but the results are not so good because the datasets used before are quite different with APPS. And I also find that the parameters in tune_apps_gpt.py are different from the parameters in deepspeed_config.json. The same problem has occurred in some other issue. Can I ask if there is a uniform answer?

Missing apps-train-files json file?

Hi,

Thank you for releasing this amazing codebase! I found that the appsdata need to take apps-train-files json file as an input but I couldn't find anything in the provided apps dataset. I wonder if I am missing somewhere.

Thanks!

Nan test case average

Hello,
I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result of which calculating mean in print_results() gives nan. How should I handle this case?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.