the-crypt-keeper / can-ai-code Goto Github PK

View Code? Open in Web Editor NEW

500.0 12.0 27.0 6.11 MB

Self-evaluating interview for AI coders

Home Page: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

License: MIT License

Python 98.84% Shell 0.69% JavaScript 0.07% Smarty 0.40%

ai ggml langchain llama-cpp llm humaneval transformers

can-ai-code's Introduction

Can AI Code?

A self-evaluating interview for AI coding models.

Key Ideas

Interview questions written by humans, test taken by AI
Inference scripts for all common API providers and CUDA-enabled quantization runtimes
Sandbox environment (Docker-based) for untrusted Python and NodeJS code validation
Evaluate effects of prompting techniques and sampling parameters on LLM coding performance
Evaluate LLM coding performance degradation due to quantization

News

8/11 Evaluate Llama-3.1-Instruct 8B HQQ.

8/10 Evaluate Llama-3.1-Instruct 8B and 70B EXL2 and some low-bit GGUFs.

8/1 Evaluate Llama-3-Instruct 8B and 70B with AQLM-2bit. Very slow. 8B is badly damaged.

8/1 Evaluate Mistral-Large-2402 with GPTQ and AWQ, both are excellent.

8/1 Evaluate Llama-3.1-Instruct 8B FP16, 70B GPTQ and AWQ with latest vLLM.

Test Suites

junior-v2 is a multi-language (Python, JavaScript) suite of 12 tests created for this project to test small LLM coding performance. This project provides all necessary components to execute this evaluation.

🚧 humaneval is a Python-only suite of 164 tests created by OpenAI. This project provides template scripts to prepare and execute the humaneval interview, as well as result extraction scripts to help their evaluator. See https://github.com/openai/human-eval for more information.

Click to see Leaderboard on HF Spaces

Click to see Comparisons on HF Spaces

Results data

All model answers and evaluation results are now included inside this repository! Install a recent release of streamlit pip install streamlit==1.23 then streamlit run app.py or streamlit run compare-app.py to run the above webapps locally.

Results HumanEval

🚧 humaneval/ development work is currently paused, there's other projects that are much further along.

See https://github.com/my-other-github-account/llm-humaneval-benchmarks and https://github.com/abacaj/code-eval for large lists of Humaneval LLM benchmark results.

Repository Structure

Interviews

junior-v2/*.yaml - junior coder interview questions (stable)
senior/*.yaml - senior coder interview questions (WIP)

Prepare

prompts/*.txt - LLM prompt templates for the various models
prepare.py - Applies templates to question turning them into language- and model-specific prompts suitable for interview

Prompts

See prompts/ for all prompts references in the leaderboard.

Interview

params/*.json - Sampling hyper-parameter sets (used by all interview scripts)
interview-*.py - Interview scripts

Parameters

See params/ for all params references in the leaderboard.

Evaluate

evaluate.py - Run tests for the generated code in a sandbox and grades each answer
app.py - Streamlit webapp to explore results, see https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

Compare

compare.py - Performs comparisons between evaluations, optionally calling out to an LLM for analysis
compare-app.py - Streamlit webapp to explore comparisons, see https://huggingface.co/spaces/mike-ravkine/can-ai-code-compare
compare/*.yaml - Compare configurations
compare/*.json - Compare results

Interviewers: API

API Runtime	Script
LiteLLM (OpenAI, etc..)	`interview-litellm.py`
OobaBooga/KoboldCpp	`interview-oobabooga.py`
Huggingface Inference	`interview-hfinference.py`
Gradio (HF Spaces)	`interview-gradio.py`

Interviewers: CUDA (Local)

Quantization Type	Script	Dependency
GGUF	`interview-llamacpp.py`	llamacpp or ggml binary
GPTQ (AutoGptQ)	`interview-cuda.py`	auto-gptq==0.6.0
GPTQ (ExLlama)	`interview-cuda.py`	exllama @ 3b013cd53c7d413cf99ca04c7c28dd5c95117c0d
EXL2, GPTQ (ExLlama2)	`interview-cuda.py`	exllamav2 @ 0.0.12
HQQ	`interview-cuda.py`	hqq @ 0.1.1
AWQ, FP16 (vLLM)	`interview-cuda.py`	vllm==0.3.0
CTranslate2	`interview-cuda.py`	ctranslate2>=3.16.0
bitsandbytes	`interview-cuda.py`	bitsandbytes==0.41.3
FP16 (Transformers)	`interview-cuda.py`	transformers==4.37.2

Running on Modal

The recommended modal wrapper is interview_modal_cuda11.py which builds a CUDA11.8 based container with all the above dependencies working. An interview_modal_cuda12.py is also provided, but AutoGPTQ and CTranslate2 are not compatible.

Unfortunately the nature of Modal does not allow command-line selection of eitehr LLM model or runtime engine.

To select models, open the script and uncomment the .run_function(download...) line of choice. Note that only one model can be selected at a time. To add a new model, implement a new download... function.

To select runtime, open the script and uncomment one of the RUNTIME options. Note that for transformers you must also specify QUANT.

Question Format

A set of interview questions is a folder of .yaml files. Each Question is a top-level key:

SanityList:
    Signature: "things()"
    Input: "with no inputs"
    Output: "a list with three values: the number 5, the string 'foobar', the capital city of Spain"
    Fact: "the capital city of Spain is Madrid"
    Description: "List function, see if the model can combine input facts with internal knowledge."
    Checks:
        input_name:
            assert: "f.name"
            eq: "things"

In this example SanityList is the name of the interview question.

The first four fields are used by prepare.py to create the interview:

Signature is the desired function signature
Input describes the function inputs
Output describes the function outputs
Fact is optional and provides any context that is required to correctly perform the task

These 4 variables along with language (either python or javascript) are used to expand templates in prompts/.

The last two fields are used by evaluate.py to judge the results:

Description is a human-readable explanation of why this test is useful
Checks defines the expected behavior of the output.

Checks and the 'f' object

Each check has a name, some assert value (python code) and an expected eq value.

The f object represents the sandbox view of the function. Static analysis is performed on the function signature to extract the f.name and f.args fields, while f.call allows for function evaluation.

Output formats

All scripts output automatically named .ndjson files to the results/ directory.

Each stage outputs a super-set of fields from the stage before it, so its possible to feed eval/interview back to interview (to re-run the questions) or back to eval (to re-run the eval).

prepare

results/prepare_{interview}_{languages}_{template}.ndjson

Fields:

all Question fields (Signature, Input, Output, Fact, Description)
name
language
prompt

interview

results/interview_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson

Fields:

all prepare fields
model
params
answer
runtime

eval

results/eval_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson

Fields:

all eval fields
status
passed
total
checks

Roadmap / Future Work

Development of a Senior coder test suite
Open Model Request issues

can-ai-code's People

Contributors

Stargazers

Watchers

can-ai-code's Issues

Add Minotaur on the board

Hi guys love the work.

I have been testing TheBloke/minotaur-15B-GGML ans is pretty solid you can test TheBloke/minotaur-15B-GPTQ

Comparison: Vicuna-7B bitsandbytes INT8 vs NF4 vs FP4

Need to run the interview for nf4 still and then publish a comparison.

Evaluate sahil2801/replit-code-instruct-glaive

Finetuned replit-code 3B model with apparently very high performance.

Model: https://huggingface.co/sahil2801/replit-code-instruct-glaive

Space: https://huggingface.co/spaces/teknium/sahil2801-replit-code-instruct-glaive

Prompt format (from space):

Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Write a program to perform the given task.

###Input:
{prompt}

### Response:

Evaluate tiiuae/falcon-40b-instruct

Original: https://huggingface.co/tiiuae/falcon-40b-instruct

BNB-NF4: https://modal.com/docs/guide/ex/falcon_bitsandbytes

AWQ: https://huggingface.co/abhinavkulkarni/tiiuae-falcon-40b-instruct-w4-g128-awq

GPTQ-3bit: https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ

GPTQ-4bit: https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ

GGCC (GGML fork): https://huggingface.co/TheBloke/falcon-40b-instruct-GGML

Comparison: Falcon 40B GGML 2_k vs 3_k vs 4_k vs 5_k

We have the data already, this should be a fun comparison.

Interview outputs should always be the model answer only

Opening this as a follow-up to one of the sub-issues identified by @Maykeye yesterday in #5

The interview-llamacpp.sh (and I imagine also interview-oobabooga.py ?) interviews currently output both the prompt and the answer, which will confuse the code extractor when those prompts have ``` in them.

Most robust solution here is to remove the prompt from the output at interview time; the code extractor's task is already fairly difficult, lets not make it harder then it needs to be.

Evaluate NousResearch/Nous-Hermes-Llama2

Evaluate LoupGarou/WizardCoder-Guanaco-15B-V1.1

WizardCoder further refined on a modified Guanaco dataset

https://huggingface.co/LoupGarou/WizardCoder-Guanaco-15B-V1.1

https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.1-GPTQ

Does "openai-chatgpt" on huggingface leader board means ChatGPT3.5?

Does "openai-chatgpt" on huggingface leader board means ChatGPT3.5 or 4?

Thank you for this great project.

Investigate prompting techniques

https://twitter.com/s_tworkowski/status/1679132557906964480?t=L7BXiAl8V5TqW3el9PbFlg&s=19

This prompt is claimed to boost coding performance by 50%

Paper: https://arxiv.org/abs/2307.05337

Models that write multiple functions in return to misnamed prompts are treated unfairly

Considering the following response from WizardLM 13b:

Sure! Here's an example JavaScript function fofx that takes in two numbers as arguments (x and banana) and returns their greatest common divisor using Euclid's algorithm:

function gcd(a, b) {
  if (b === 0) return a; // base case for when b is zero
  return gcd(b, a % b); // recursive call until b becomes zero
}

function fofx(x, banana) {
  const result = gcd(x, banana);
  console.log(`The GCD of ${x} and ${banana} is ${result}.`);
  return result;
}

With the current extract_function_info we'd get two functions output, but sandbox.py only ever looks at the first one so it would see gcd(a,b) as the entrypoint and fail 3 of the tests because both the function and arguments are named incorrectly - but this is clearly unfair since it did generate the correct function, and it did name the arguments correctly.

Evaluate openchat/opencoderplus

fp16 starcoder-based model: https://huggingface.co/openchat/opencoderplus

GGML quants now available: https://huggingface.co/cooki3monster/openchat_opencoderplus-GGML

Evaluate Falcon-Instruct-7B @ AWQ-INT4

https://github.com/mit-han-lab/llm-awq

AWQ Weights: https://huggingface.co/datasets/mit-han-lab/awq-model-zoo/tree/main

It looks like the AWQ Weights need to be merged with the original model to produce the quantized pt: https://github.com/mit-han-lab/llm-awq#usage

GCD: "denominator" not "demoninator"

damnit demons.

airoboros-1.4 GPTQ writes JS functions that confuse the code extractor

The 33B in javascript in particular seems to have a really weird return style

var substrCount = (function () {
    'use strict';

    //substrCount :: String -> String -> Int
    return function substrCount(str, substr) {
    ....
    }
})();

The code_extractor does not currently handle this so it gets a 0 for the test.

[SanitySecretIdentityMap] Spider-Man vs Spiderman

Several models fail the secret identifies test by replacing Spiderman with Spider-Man, which upon further investigation makes sense because that is the correct spelling of this character's name. This prompt and the associated check should be fixed.

Test bench for code extractor

We have over 100 interviews now, more then enough data to put together a regression test for extract.py to make sure we can reliably handle anything these models throw at us. See also #29

[SanityList] Seperately check each element of the output

If the model didn't get the capital of spain right, it will currently fail the test even if it output a list and the other two elements are correct.

Split into 4 checks:

that the result is a list with 3 elements
value checks for each of those elements

Evaluate mrm8488/falcoder-7b and mrm8488/llama-2-coder-7b

fp16: https://huggingface.co/mike-ravkine/falcoder-7b-GGML

GGML: https://huggingface.co/mike-ravkine/falcoder-7b-GGML

This is an experimental model that requires a fork of ggml: https://github.com/jploski/ggml/tree/falcon40b

It failed my initial sanity checks, did not produce a single function when prompted as usual.

Koala outputs html <code>

Koala-13b seems to have been trained on a code dataset with html responces, it answers like this:

<pre><div><div>Copy code</div><div><code><span>function</span> <span>secretIdentities</span>() {
  <span>return</span> {
    Superman: 'Clark Kent',
    Batman: 'Bruce Wayne',
    SpiderA: 'Peter Parker'
  };
}
</code></div></div></pre>

Right now only interview-starchat.py handles this, it should be moved to extract_code

Evaluate interesting falcon-7b finetunes

Openassistant refined falcon: https://huggingface.co/OpenAssistant/falcon-7b-sft-top1-696

CodeAlpaca-20k refined falcon: https://huggingface.co/mrm8488/falcon-7b-ft-codeAlpaca_20k-v2

Models that generate python test code using input() cause timeouts

lets try to monkey-patch input() to return 0 instead of hanging and see how it goes

bad test code is an irritating problem

Evaluate Salesforce/codet5p family

Model: https://huggingface.co/Salesforce/codet5p-16b

this is the precursor to codegen

Links for AI coders

Would be helpful if (clickable) links for each of these tool is populated in this table.

https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

Evaluate WizardCoder @ GPTQ-4bit

Model: https://huggingface.co/TheBloke/WizardCoder-15B-1.0-GPTQ

Merge interview-starchat and interview-wizardcoder into a generic interview-gradio

For each model we just need:

gradio url
parameter list and fn_index
to know if we have to do file i/o for input and outputs

Evaluate Salesforce/codegen25-7b-multi and mono

Regular function signature prompting: https://huggingface.co/Salesforce/codegen25-7b-mono#causal-sampling-code-autocompletion

Infill prompting: https://huggingface.co/Salesforce/codegen25-7b-mono#infill-sampling

def format(prefix, suffix):
  return prefix + "<mask_1>" + suffix + "<|endoftext|>" + "<sep>" + "<mask_1>"

Evaluate.py: code extraction doesn't extract code marked by ```js

I tried to run on Guanaco-33b by modyfing interview-llamacpp.sh, and hit the issue with evaluator

Consider this

In [4]: gu=open("results/Guanaco-33b/FactorialZeroShot-javascript.txt").read()

In [5]: print(gu)
 When asked to write code, return only the final function enclosed in a ``` block.

### Instruction: Write a javascript function to compute the factorial of input n.

### Assistant: Here is an example implementation of a JavaScript function that computes the factorial of a non-negative integer `n`:
```js
function fact(n) {
  if (n === 0) {
    return 1;
  } else {
    return n * fact(n - 1);
  }
}
```
This function takes a single argument `n`, which represents the number for which you want to calculate the factorial. The function first checks if `n` is equal to 0. If it is, then the function returns 1, as the factorial of 0 is defined to be 1. Otherwise, the function calculates the factorial recursively by multiplying `n` with the result of calling the same function on `n - 1`.

If you call to extract code you will get

In [7]: print(extract_code(gu))
block.

### Instruction: Write a javascript function to compute the factorial of input n.

### Assistant: Here is an example implementation of a JavaScript function that computes the factorial of a non-negative integer `n`:

There are two issues at play:

``` is a part of the included prompt.

extract_code will always find it, because it always present.

So extract code firstly things that ''' in prompt is a start of the actual code block,
then it mistakes start of the real codeblock for the end of the codeblock

extract_code doesn't support '''js. Only '''javascript
E.g. in github Js alias is valid for markdown of javascript

So even if 1) were fixed, current extract_code wouldn't parse the answer correctly: it would return 'js\nfunction fact(n) {\n if (n === 0) {\n return 1;\n } else {\n return n * fact(n - 1);\n }\n}' without removing js tag

Maybe whole line with first ''' should be erased?

(On similar note, '''py seems to be valid language name for python, at least as github markdown engine is concerned)

(Also evaluate.py has no __name__ == __main__ guard), which makes importing it from ipython impossible without extracting the function first)

Evaluate openlm-research/open_llama_7b_v2

https://huggingface.co/openlm-research/open_llama_7b_v2

Trained on starcoder data so should be a contender

Evaluate TokenBender/llama2-7b-chat-hf-codeCherryPop-qLoRA-merged

This is a llama2 7B HF chat model fine-tuned on 122k code instructions. In my early experiments it seems to be doing very well.

Evaluate InCoder

https://github.com/dpfried/incoder

https://huggingface.co/facebook/incoder-6B

Evaluate replit/replit-code-v1-3b

With the impressive performance of #25 it probably makes sense to evaluate the original model as well, these <3B coders are very interesting.

Model: https://huggingface.co/replit/replit-code-v1-3b

Interesting note on decoding here from the model page:

# decoding, clean_up_tokenization_spaces=False to ensure syntactical correctness
generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_code)

* trust_remote_code=True is passed to the from_pretrained method because ReplitLM is not a class in the [Transformers](https://huggingface.co/docs/transformers/index) library.
* clean_up_tokenization_spaces=False is meant to avoid removing spaces in the output, because that would affect the syntactical correctness of the generated code.

May need to explore clean_up_tokenization_spaces more closely.

Also an interesting note on post-processing here: https://huggingface.co/replit/replit-code-v1-3b#post-processing for the most part our extractor handles this.

Evaluate nickrosh/Evol-Replit-v1

A replit/replit-code-v1-3b finetune on Evol-Instruct-Code-80k-v1 that achieves 31.1% on humaneval

https://huggingface.co/nickrosh/Evol-Replit-v1

https://github.com/nickrosh/evol-teacher

Evaluate upstage/llama-30b-instruct

This is currently the #1 model on https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard beating the 70b llama2.

[FibonacciZeroShot] Many models interpret this request as asking for the n-th element

The reason FibonacciListZeroShot was added is that several models couldn't handle the FibonacciZeroShot prompt and returned the n-th element instead. It probably makes more sense for FibonacciZeroShot to explicitly ask for the n-th element and be checked as such, then we get both cases covered robustly.

Open LLaMA open instruct 7b

Hi, I'm curious about the performance of the OpenLLaMA 7b (open instruct) model and it's quantization forms:

Any experiments on this end so far? I'll clone this repo today and give it a try to benchmark these, not sure how far I'll get though.

Starchat (various issues)

Python:

This model seems to output pre-indented code, which is not valid. Should we be lenient and auto-correct this? No other model exhibits this behavior.

JavaScript:

This model outputs arrow functions, which is prefectly valid but not currently parsed correctly by extract_function_info

New quantization: SqueezeLLM

Code: https://github.com/SqueezeAILab/SqueezeLLM

Benchmark.py is as close to an inference engine as this thing has but it's doesn't bother to run the token decoder.

Evaluate mosaicml/mpt-30b-instruct and its quants

FP16: https://huggingface.co/mosaicml/mpt-30b-instruct

INT8: https://huggingface.co/Abzu/mpt-30b-instruct-q8

GGML: https://huggingface.co/TheBloke/mpt-30B-instruct-GGML

AWQ: https://huggingface.co/abhinavkulkarni/mosaicml-mpt-30b-instruct-w4-g128-awq

GPTQ (note its the chat not instruct model and TheBloke couldnt get it working well): https://huggingface.co/TheBloke/MPT-30B-Chat-GPTQ-experimental

Leaderboard: Sorting by the 'size' column doesnt work

Text sort instead of numeric, not a very useful result.

extract_code fails when model forgets opening ``` but remembers the closing one

falcon-instruct-7B AWQ spit out:

> function fofx(x, banana) {
>  let gcd = x;
>  let result = 0;
>  while (gcd > 0) {
>    gcd = gcd * x;
>    result += x;
>  }
>  return result;
> }
>```

This fails code extraction as the trailing marker is interpreted as the opening one and no closing marker is found so it extract a blank string.

"fibonacci" not "fibbonaci"

llama2 chewed me out:

Hello! I'd be happy to help you with your question. However, I notice that the term "fibbonaci" might be a typo, and I assume you meant "Fibonacci."

This would change the prompts and require a total do-over of evaluations.

the-crypt-keeper / can-ai-code Goto Github PK

can-ai-code's Introduction

Can AI Code?

Key Ideas

News

Test Suites

Results data

Results HumanEval

Repository Structure

Interviews

Prepare

Prompts

Interview

Parameters

Evaluate

Compare

Interviewers: API

Interviewers: CUDA (Local)

Running on Modal

Question Format

Checks and the 'f' object

Output formats

prepare

interview

eval

Roadmap / Future Work

can-ai-code's People

Contributors

Stargazers

Watchers

Forkers

can-ai-code's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs