Comments (1)
I ran it and got the same thing you got.
I made a fix and was able to run it.
See #352 for my proposed fixes to custom-eval.md
.
TL;DR There are two issues in docs/custom-eval.md
: (1) The training samples have no problem
and answer
keys but eval_sample()
assumes it; and (2) eval_sample()
tries to access a dict
when the intention is to access a list
.
Observe the documentation states:
docs/custom-eval.md
echo -e '[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]\n[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]\n[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]' > /tmp/test.jsonl
and leads to the creation of training and test data:
/tmp/train.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]
Now observe eval_sample()
:
def eval_sample(self, test_sample, rng: random.Random):
"""
...
"""
stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
prompt = [
{"role": "system", "content": "Solve the following math problems"},
]
for i, sample in enumerate(stuffing + [test_sample]):
if i < len(stuffing):
prompt += [
{"role": "system", "content": sample["problem"], "name": "example_user"},
{"role": "system", "content": sample["answer"], "name": "example_assistant"},
]
else:
prompt += [{"role": "user", "content": sample["problem"]}]
evals.check_sampled_text(self.model_spec, prompt, expected=sample["answer"])
Specifically:
for i, sample in enumerate(stuffing + [test_sample]):
sample["problem"]
sample["answer"]
sample["problem"]
sample["answer"]
Here, it's grabbing test samples. Observe:
sample["problem"]
and
sample["answer"]
and notice that neither problem
nor answer
are found in the samples:
/tmp/train.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]
Hence, one solution is to "make it found" in the samples. Specifically, change content
to either problem
or answer
in the intended places:
docs/custom-eval.md
echo -e '[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]\n[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]\n[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]' > /tmp/test.jsonl
/tmp/train.jsonl
[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]
[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]
[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]
To solve the next issue, recall the way the values are accessed:
sample["problem"]
and
sample["answer"]
Here, it's trying to access a dictionary, however:
/tmp/test.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
each JSON line is a list; hence, assuming we keep the test file as such, we must access the list first.
Thus, eval_sample()
can be updated to:
def eval_sample(self, test_sample, rng: random.Random):
"""
Called by the `eval_all_samples` method to evaluate a single sample.
ARGS
====
`test_sample`: a line from the JSONL test file
`rng`: should be used for any randomness that is needed during evaluation
This method does the following:
1. Generate a prompt that contains the task statement, a few examples, and the test question.
2. Check if the model generates the correct answer.
"""
stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
prompt = [
{"role": "system", "content": "Solve the following math problems"},
]
for i, sample in enumerate(stuffing + [test_sample]):
if i < len(stuffing):
prompt += [
{"role": "system", "content": sample[0]["problem"], "name": "example_user"},
{"role": "system", "content": sample[1]["answer"], "name": "example_assistant"},
]
else:
prompt += [{"role": "user", "content": sample[0]["problem"]}]
evals.check_sampled_text(self.model_spec, prompt, expected=sample[1]["answer"])
Observe that I added [0]
s and [1]
s to first access the list. This modification makes it consistent with the samples, which first assume a list to access the data.
I was able to use the debugger in VS Code by creating a folder called .vscode
at the root of the project directory and putting a launch.json
file in it with the following configuration:
.vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Run custom arithmetic eval",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/evals/cli/oaieval.py",
"args": [
"gpt-3.5-turbo",
"arithmetic"
],
"console": "integratedTerminal"
}
]
}
After running it:
user@user:~/evals$ cd /home/user/evals ; /usr/bin/env /home/user/evals/env/bin/python /home/user/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 50149 -- /home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic
[2023-03-19 14:11:00,322] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-19 14:11:00,440] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-19 14:11:00,987] [oaieval.py:178] Run started: 230319181100THJXSEOF
[2023-03-19 14:11:00,990] [eval.py:30] Evaluating 2 samples
[2023-03-19 14:11:01,018] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-19 14:11:01,911] [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
[2023-03-19 14:11:01,912] [oaieval.py:209] Final report:
[2023-03-19 14:11:01,912] [oaieval.py:211] accuracy: 1.0
[2023-03-19 14:11:02,649] [record.py:309] Logged 6 rows of events to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl: insert_time=1.595ms
I was able to get an accuracy
metric of 1.0
with the following output (formatted as a .json
and not .jsonl
for readability):
/tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
{
"spec": {
"model_name": "gpt-3.5-turbo",
"model_names": {
"completions": [
"gpt-3.5-turbo"
]
},
"eval_name": "arithmetic.dev.match-v1",
"base_eval": "arithmetic",
"split": "dev",
"run_config": {
"model_specs": {
"completions_": [
{
"name": "gpt-3.5-turbo",
"model": "gpt-3.5-turbo",
"is_chat": true,
"encoding": null,
"organization": null,
"api_key": null,
"extra_options": {},
"headers": {},
"strip_completion": true,
"n_ctx": 4096,
"format": null,
"key": null,
"group": null
}
],
"embedding_": null,
"ranking_": null
},
"eval_spec": {
"cls": "evals.elsuite.arithmetic:Arithmetic",
"args": {
"train_jsonl": "/tmp/train.jsonl",
"test_jsonl": "/tmp/test.jsonl"
},
"key": "arithmetic.dev.match-v1",
"group": "arithmetic"
},
"seed": 20220722,
"max_samples": null,
"command": "/home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic",
"initial_settings": {
"visible": true
}
},
"created_by": "",
"run_id": "230319181100THJXSEOF",
"created_at": "2023-03-19 18:11:00.985636"
}
}
{
"final_report": {
"accuracy": 1.0
}
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 0,
"sample_id": "arithmetic.dev.1",
"type": "raw_sample",
"data": [
{
"role": "system",
"problem": "4*4=",
"name": "example_user"
},
{
"role": "system",
"answer": "16",
"name": "example_assistant"
}
],
"created_by": "",
"created_at": "2023-03-19 18:11:01.020114+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 1,
"sample_id": "arithmetic.dev.0",
"type": "raw_sample",
"data": [
{
"role": "system",
"problem": "2+2=",
"name": "example_user"
},
{
"role": "system",
"answer": "4",
"name": "example_assistant"
}
],
"created_by": "",
"created_at": "2023-03-19 18:11:01.025320+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 2,
"sample_id": "arithmetic.dev.0",
"type": "sampling",
"data": {
"prompt": [
{
"role": "system",
"content": "Solve the following math problems"
},
{
"role": "system",
"content": "5*20=",
"name": "example_user"
},
{
"role": "system",
"content": "100",
"name": "example_assistant"
},
{
"role": "system",
"content": "48+2=",
"name": "example_user"
},
{
"role": "system",
"content": "50",
"name": "example_assistant"
},
{
"role": "user",
"content": "2+2="
}
],
"sampled": "4",
"options": [
"4"
],
"picked": "4",
"expected": [
"4"
],
"match": true,
"metadata": {
"completion_id": "chatcmpl-6vrmjgKmc1MtZhavJOpT6nwCwkaft",
"model": "gpt-3.5-turbo-0301"
}
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.885894+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 3,
"sample_id": "arithmetic.dev.0",
"type": "match",
"data": {
"correct": true,
"expected": "4",
"picked": "4",
"sampled": "4"
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.885968+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 4,
"sample_id": "arithmetic.dev.1",
"type": "sampling",
"data": {
"prompt": [
{
"role": "system",
"content": "Solve the following math problems"
},
{
"role": "system",
"content": "5*20=",
"name": "example_user"
},
{
"role": "system",
"content": "100",
"name": "example_assistant"
},
{
"role": "system",
"content": "48+2=",
"name": "example_user"
},
{
"role": "system",
"content": "50",
"name": "example_assistant"
},
{
"role": "user",
"content": "4*4="
}
],
"sampled": "16",
"options": [
"16"
],
"picked": "16",
"expected": [
"16"
],
"match": true,
"metadata": {
"completion_id": "chatcmpl-6vrmjWDVgCDKZNpKt6j9l7OrAjv56",
"model": "gpt-3.5-turbo-0301"
}
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.903103+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 5,
"sample_id": "arithmetic.dev.1",
"type": "match",
"data": {
"correct": true,
"expected": "16",
"picked": "16",
"sampled": "16"
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.903153+00:00"
}
Thanks for bringing this up!
from evals.
Related Issues (20)
- Registry path CLI option for oaievalset HOT 1
- Awaiting GPT-4 Access For Further Development HOT 2
- Functions: Support for minItems and maxItems for json schema array
- Access not working HOT 2
- Evaluate `gpt-4-0613` and `gpt-3.5-turbo-0613` yields `invalid_request_during_completion` HOT 3
- You should see GPT-4 API access enabled in your account in the next few days. HOT 1
- closedqa prompt is not adequate for gpt-4-0613
- 国际化的支持 HOT 3
- gpt-4-32k HOT 5
- Expose run_id to code being run within an eval HOT 1
- Code Evals
- Meaning of "elsuite" folder name HOT 2
- Unable to modify match_fn from within modelgraded eval .yaml file
- oaieval hangs a lot HOT 1
- Please approve pull request, changes were made.
- All evals currently in the repo appear only to have dev samples: is this correct? HOT 1
- Accuracy Score
- Sample evaluations completing after timeout cause duplicate results HOT 1
- Find claims from research paper
- Having trouble building Evals locally? Try this. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from evals.