Got the below error message when I ran the oaieval gpt-3.5-t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

TypeError about evals HOT 1 CLOSED

openai commented on June 1, 2024

TypeError

from evals.

Comments (1)

jonathanagustin commented on June 1, 2024 3

@rarhs

I ran it and got the same thing you got.

I made a fix and was able to run it.

See #352 for my proposed fixes to custom-eval.md.

TL;DR There are two issues in `docs/custom-eval.md`: (1) The training samples have no `problem` and `answer` keys but `eval_sample()` assumes it; and (2) `eval_sample()` tries to access a `dict` when the intention is to access a `list`.

Observe the documentation states:

docs/custom-eval.md

echo -e '[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]\n[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]\n[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]' > /tmp/test.jsonl

and leads to the creation of training and test data:

/tmp/train.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]

Now observe eval_sample():

    def eval_sample(self, test_sample, rng: random.Random):
        """
        ...
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)

        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]

        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample["problem"], "name": "example_user"},
                    {"role": "system", "content": sample["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample["problem"]}]

        evals.check_sampled_text(self.model_spec, prompt, expected=sample["answer"])

Specifically:

        for i, sample in enumerate(stuffing + [test_sample]):

                                                  sample["problem"]
                                                  sample["answer"]


                                                       sample["problem"]

                                                                   sample["answer"]

Here, it's grabbing test samples. Observe:

sample["problem"]

and

sample["answer"]

and notice that neither problem nor answer are found in the samples:

/tmp/train.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]

Hence, one solution is to "make it found" in the samples. Specifically, change content to either problem or answer in the intended places:

docs/custom-eval.md

echo -e '[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]\n[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]\n[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]' > /tmp/test.jsonl

/tmp/train.jsonl

[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]
[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]
[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]

To solve the next issue, recall the way the values are accessed:

sample["problem"]

and

sample["answer"]

Here, it's trying to access a dictionary, however:

/tmp/test.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

each JSON line is a list; hence, assuming we keep the test file as such, we must access the list first.

Thus, eval_sample() can be updated to:

    def eval_sample(self, test_sample, rng: random.Random):
        """
        Called by the `eval_all_samples` method to evaluate a single sample.

        ARGS
        ====
        `test_sample`: a line from the JSONL test file
        `rng`: should be used for any randomness that is needed during evaluation

        This method does the following:
        1. Generate a prompt that contains the task statement, a few examples, and the test question.
        2. Check if the model generates the correct answer.
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)

        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]

        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample[0]["problem"], "name": "example_user"},
                    {"role": "system", "content": sample[1]["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample[0]["problem"]}]

        evals.check_sampled_text(self.model_spec, prompt, expected=sample[1]["answer"])

Observe that I added [0]s and [1]s to first access the list. This modification makes it consistent with the samples, which first assume a list to access the data.

I was able to use the debugger in VS Code by creating a folder called .vscode at the root of the project directory and putting a launch.json file in it with the following configuration:

.vscode/launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run custom arithmetic eval",
            "type": "python",
            "request": "launch",
            "program": "${workspaceFolder}/evals/cli/oaieval.py",
            "args": [
                "gpt-3.5-turbo",
                "arithmetic"
            ],
            "console": "integratedTerminal"
        }
    ]
}

After running it:

user@user:~/evals$  cd /home/user/evals ; /usr/bin/env /home/user/evals/env/bin/python /home/user/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 50149 -- /home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic 
[2023-03-19 14:11:00,322] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-19 14:11:00,440] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-19 14:11:00,987] [oaieval.py:178] Run started: 230319181100THJXSEOF
[2023-03-19 14:11:00,990] [eval.py:30] Evaluating 2 samples
[2023-03-19 14:11:01,018] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-19 14:11:01,911] [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
[2023-03-19 14:11:01,912] [oaieval.py:209] Final report:
[2023-03-19 14:11:01,912] [oaieval.py:211] accuracy: 1.0
[2023-03-19 14:11:02,649] [record.py:309] Logged 6 rows of events to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl: insert_time=1.595ms

I was able to get an accuracy metric of 1.0 with the following output (formatted as a .json and not .jsonl for readability):

/tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl

{
    "spec": {
        "model_name": "gpt-3.5-turbo",
        "model_names": {
            "completions": [
                "gpt-3.5-turbo"
            ]
        },
        "eval_name": "arithmetic.dev.match-v1",
        "base_eval": "arithmetic",
        "split": "dev",
        "run_config": {
            "model_specs": {
                "completions_": [
                    {
                        "name": "gpt-3.5-turbo",
                        "model": "gpt-3.5-turbo",
                        "is_chat": true,
                        "encoding": null,
                        "organization": null,
                        "api_key": null,
                        "extra_options": {},
                        "headers": {},
                        "strip_completion": true,
                        "n_ctx": 4096,
                        "format": null,
                        "key": null,
                        "group": null
                    }
                ],
                "embedding_": null,
                "ranking_": null
            },
            "eval_spec": {
                "cls": "evals.elsuite.arithmetic:Arithmetic",
                "args": {
                    "train_jsonl": "/tmp/train.jsonl",
                    "test_jsonl": "/tmp/test.jsonl"
                },
                "key": "arithmetic.dev.match-v1",
                "group": "arithmetic"
            },
            "seed": 20220722,
            "max_samples": null,
            "command": "/home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic",
            "initial_settings": {
                "visible": true
            }
        },
        "created_by": "",
        "run_id": "230319181100THJXSEOF",
        "created_at": "2023-03-19 18:11:00.985636"
    }
}
{
    "final_report": {
        "accuracy": 1.0
    }
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 0,
    "sample_id": "arithmetic.dev.1",
    "type": "raw_sample",
    "data": [
        {
            "role": "system",
            "problem": "4*4=",
            "name": "example_user"
        },
        {
            "role": "system",
            "answer": "16",
            "name": "example_assistant"
        }
    ],
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.020114+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 1,
    "sample_id": "arithmetic.dev.0",
    "type": "raw_sample",
    "data": [
        {
            "role": "system",
            "problem": "2+2=",
            "name": "example_user"
        },
        {
            "role": "system",
            "answer": "4",
            "name": "example_assistant"
        }
    ],
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.025320+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 2,
    "sample_id": "arithmetic.dev.0",
    "type": "sampling",
    "data": {
        "prompt": [
            {
                "role": "system",
                "content": "Solve the following math problems"
            },
            {
                "role": "system",
                "content": "5*20=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "100",
                "name": "example_assistant"
            },
            {
                "role": "system",
                "content": "48+2=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "50",
                "name": "example_assistant"
            },
            {
                "role": "user",
                "content": "2+2="
            }
        ],
        "sampled": "4",
        "options": [
            "4"
        ],
        "picked": "4",
        "expected": [
            "4"
        ],
        "match": true,
        "metadata": {
            "completion_id": "chatcmpl-6vrmjgKmc1MtZhavJOpT6nwCwkaft",
            "model": "gpt-3.5-turbo-0301"
        }
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.885894+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 3,
    "sample_id": "arithmetic.dev.0",
    "type": "match",
    "data": {
        "correct": true,
        "expected": "4",
        "picked": "4",
        "sampled": "4"
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.885968+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 4,
    "sample_id": "arithmetic.dev.1",
    "type": "sampling",
    "data": {
        "prompt": [
            {
                "role": "system",
                "content": "Solve the following math problems"
            },
            {
                "role": "system",
                "content": "5*20=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "100",
                "name": "example_assistant"
            },
            {
                "role": "system",
                "content": "48+2=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "50",
                "name": "example_assistant"
            },
            {
                "role": "user",
                "content": "4*4="
            }
        ],
        "sampled": "16",
        "options": [
            "16"
        ],
        "picked": "16",
        "expected": [
            "16"
        ],
        "match": true,
        "metadata": {
            "completion_id": "chatcmpl-6vrmjWDVgCDKZNpKt6j9l7OrAjv56",
            "model": "gpt-3.5-turbo-0301"
        }
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.903103+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 5,
    "sample_id": "arithmetic.dev.1",
    "type": "match",
    "data": {
        "correct": true,
        "expected": "16",
        "picked": "16",
        "sampled": "16"
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.903153+00:00"
}

Thanks for bringing this up!

from evals.

TypeError about evals HOT 1 CLOSED

Comments (1)

TL;DR There are two issues in `docs/custom-eval.md`: (1) The training samples have no `problem` and `answer` keys but `eval_sample()` assumes it; and (2) `eval_sample()` tries to access a `dict` when the intention is to access a `list`.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

Comments (1)

TL;DR There are two issues in docs/custom-eval.md: (1) The training samples have no problem and answer keys but eval_sample() assumes it; and (2) eval_sample() tries to access a dict when the intention is to access a list.

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

TL;DR There are two issues in `docs/custom-eval.md`: (1) The training samples have no `problem` and `answer` keys but `eval_sample()` assumes it; and (2) `eval_sample()` tries to access a `dict` when the intention is to access a `list`.