I'm Dan, a PhD student in ML at UC Berkeley.
See my webpage for my research.
See below for my code.
APPS: Automated Programming Progress Standard (NeurIPS 2021)
License: MIT License
I'm Dan, a PhD student in ML at UC Berkeley.
See my webpage for my research.
See below for my code.
Not necessarily an issue, but I noticed that for train/val, the answer_type is based on whether starter_code exists but that at eval time, it's based on fn_name. Is there a reason for this difference?
hi there! I am currently doing my thesis on Chatgpt. the main aim is to evaluate the program ability of chatgpt, I wonder if this dataset can be used in conversational form, like giving a prompt to gpt and returning the code to evaluate.
Hi, thanks for your work.
I don't quite understand the role of check5 in the evaluating process, it seems to bring some wrong results. Here is an example of 4496 test problem.
The question is:
My program is:
When I pass 22 into the program, the ideal return result is “Christmas Eve Eve Eve”, but this program returns “Christmas Eve”. Obviously, this is a wrong answer, but check5 in the “run_test” function judges the result as correct.
Is it a bug? Looking forward to your reply.
Hi, thanks for the amazing work! I really appreciate that you released the dataset, but now I wanna apply it to other models downloaded from Hugging Face. I wonder if I can get the scripts of fine-tuning?
Hi thank you for this great dataset! I have some questions about how you compute the accuracy scores in this
apps/eval/test_one_solution.py
Lines 22 to 42 in c55cce3
-2
and -1
for compilation and runtime errors and include them in the average computation of the accuracy which could lead to a negative score. It seems more natural to give a False
label to a code with syntax/runtime error similarily to a code that just doesn’t pass the unit tests.
Also the expression all_correct.append(np.all(results[index]))
will consider -2 and -1 as True
since np.all
evaluates non zero numbers to True
, which could give a false accuracy.
Below is an example:
print_results({0: [[-2]], 1: [[-2]], 2: [[-2]], 3: [[-2]]}, args)
number of compile errors = 1 avg = 0.25
number of runtime errors = 1 avg = 0.25
number of test cases run = 4
Test Case Average (average accuracy over problems) = -2.0
Strict Accuracy (all test cases passed / total problems) = 1.0
Another thing regarding the expressions:
compile_errors = len(tmp_results[tmp_results==-2])
runtiome_errors = len(tmp_results[tmp_results==-1])
if I'm not mistaken this doesn't work (at least on Python 3.9), another implementation could be
compile_errors = len([e for e in tmp_results if -2 in e])
runtiome_errors = len([e for e in tmp_results if -1 in e])
Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:
Test Case Average (average accuracy over problems) = 0.27318586602648753
Strict Accuracy (all test cases passed / total problems) = 0.263
Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?
Thank you in advance for any help!
It would be nice to see the data schema before downloading the full dataset, to the benefit of those who might have to write a parser.
Thanks for creating this amazing datasets.
I want to ask if any solution in Solutions.json is strictly correct (can pass all unit tests in input_output.json )?
Do you check it?
If not,how do you ensure the correctness of the solution?
Thanks!
Paper is easy to find anyway, but just in case you want to fix :)
Hey there!
Congrats and thanks for the amazing work! The APPS dataset would benefit the community greatly.
I really appreciate that you released the GPT2-1.5B finetuned model, but just curious would it possible to release the pretrained GPT2-1.5B model as well?
Thank you in advance and happy new year!
Hi, I'm trying to run finetuning to replicate the results in the paper but am getting an error from a mismatch in hyperparameters between deepspeed_config.json and what's specified in tune_apps_gpt.py (e.g. an LR of 1e-4 in deepspeed_config.json, but 5e-5 in tune_apps_gpt.py).
Could you give any guidance on which to use?
The error I'm getting is:
Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
- ds train_batch_size=8 vs hf train_batch_size (calculated)=128
- ds optimizer.params.lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=5e-05
- ds scheduler.params.warmup_num_steps=500 vs hf warmup_steps=0
The easiest method is to set these DeepSpeed config values to 'auto'.
and the command is
USE_TF=NO deepspeed tune_apps_gpt.py \
--save-dir=${save_dir} \
--arch=EleutherAI/gpt-neo-2.7B \
--apps-train-files ../data/train \
--apps-dataroot ../data/train/ \
--grad-acc-steps=8 \
--epochs=10 \
--fp16 \
--deepspeed deepspeed_config.json \
--batch-size-per-replica=2 \
| tee ${save_dir}/log.out
Thanks!
Hi,thanks for the amazing work!
I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution.
(e.g. After a code solution was generated, did you truncate it by stop tokens?(e.g. : “\nclass”, “\ndef”, “\n#”))
Thanks for your reply!
Dear Hendrycks,
Thanks for providing the amazing work!
Recently, I want to fine-tune the model on my collected code data. Could you provide an exemplar custom HTLM parser (Paragraph **Dataset Construction** under Section 3) on how you pre-process the HTML description file? I want to keep the description format constant.
Thanks in advance.
Zhenfang
Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at once?
I have added the metric to the HuggingFace hub https://huggingface.co/spaces/codeparrot/apps_metric (I didn’t change the core script testing_util.py) with evaluation done for all solutions at once and I sometimes get a memory leak for which I can’t identify the source because when I do the evaluation on the same solutions separately this doesn’t happen.
Below is the code that causes memory saturation:
from evaluate import load
generations = [["s = input()\nn = len(s)\nm = 0\n\nfor i in range(n):\n\tc = s[i]\n\tif c == '|':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\telif c == '\\n':\n\t\tif m < 2:\n\t\t\tm = 2\n\t\telse:\n\t\t\tm += 1\n\nif m < 2:\n\tprint(-1)\nelse:\n\tprint(m * 2 - 1)\n"], ["\nx = int(input())\n\nl = list(range(x+1))\n\nm = next(l)\n\ns = sum(list([int(i) for i in str(m)]))\n\nif s > sum(list([int(i) for i in str(m)])) :\n\tm = next(l)\n\t\nprint(m)\n"]]
metric = load("codeparrot/apps_metric")
results = metric.compute(predictions=generations, level="all", debug=False)
While this works fine:
generation_1 = generations[:1]
generation_2 = generations[1:2]
results_1 = metric.compute(predictions=generation_1, level="all", debug=False)
results_2 = metric.compute(predictions=generation_2, level="all", debug=False)
print(results_1)
print(results_2)
{'avg_accuracy': 0.23185840707964603, 'strict_accuracy': 0.0, 'pass_at_k': None}
{'avg_accuracy': 0.0, 'strict_accuracy': 0.0, 'pass_at_k': None}
Hi, thanks for your hard works.
Whether should we add a line of code problems = sorted(problems)
here to pin some ordering? Just like:
apps/eval/generate_gpt_codes.py
Line 113 in 473a497
If I have any mistake, please point out.
Thanks
Reference: https://github.com/hendrycks/apps#how-to-use
Files train/README and eval/README are not present in the repository.
Would really appreciate if the instructions are added for training and evaluation of the models.
Hello, Could you provide weights of the fine-tuned GPT-2 0.1B mentioned in the paper?
Hi, thanks for providing the benchmark dataset. From the paper, problems are categorized into Introductory, Interview, Competition, etc. Could you provide the problem ids corresponding to the difficulty level? It would help a lot.
Hi, Have you checked the ground truth solutions in the original dataset to make sure that they pass the test cases?
I'm trying to run the pre-trained 1.5B model linked in the README on the APPS test set. I downloaded the dataset and ran the script train/apps_create_split.py
on it, then ran the model with
python generate_gpt_codes.py -t ~/Code/APPS/test.json --load ~/Code/APPS/models/1.5B --save ~/Code/APPS/output/15B
Note that I didn't do training beforehand, the directory models/1.5B
is as it was when I downloaded it - I assume this is fine since the README says the models are fine-tuned.
When I look at the contents of all_codes.json
, at first it looks okay, but pretty soon all I see are empty entries like this:
... "9": "", "10": "", "11": "", "12": "", "13": "", "14": "" ...
I see several messages in the script output that seem like potential errors:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 1052, but `max_length` is set to 1023. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [207,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Many of those errors are printed over and over, and then the end of the logs are just this message thousands of times:
Unexpected exception in generating solution
Batch dimension of `input_ids` should be 5, but is 4.
Hi, The download link in readme for dataset is currently broken.
There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output is definitely longer than input. Max_length(1024-input_ids) is set for the output. Actually, if output's length needs to meet the requirement, input's length is much less than 1024. Otherwise we won't get a complete answer even if not reporting an error. Is it right? Also, why is max_length of output is set to "1024-inputs"?
Thanks for the response for the last question. I have generated reasonable results using the provided models and code. I trained my model with reference to the parameters given in tune_apps_gpt.py , but the results are not so good because the datasets used before are quite different with APPS. And I also find that the parameters in tune_apps_gpt.py are different from the parameters in deepspeed_config.json. The same problem has occurred in some other issue. Can I ask if there is a uniform answer?
I'm browsing through the APPSBaseDataset code and noticed that the code is reindented to match the format of the Github dataset. If I'm using the APPS dataset from HuggingFace, do I still need to do the reindentation before fine-tuning a model on it?
Hi,
Thank you for releasing this amazing codebase! I found that the appsdata need to take apps-train-files json file as an input but I couldn't find anything in the provided apps dataset. I wonder if I am missing somewhere.
Thanks!
Hello,
I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result of which calculating mean in print_results() gives nan. How should I handle this case?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.