GithubHelp home page GithubHelp logo

openai / evals Goto Github PK

View Code? Open in Web Editor NEW
13.8K 257.0 2.5K 6.54 MB

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

License: Other

Makefile 0.01% Python 79.31% JavaScript 0.34% Jupyter Notebook 13.47% Shell 1.63% HTML 5.22% Dockerfile 0.04%

evals's Introduction

OpenAI Evals

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case. In the words of OpenAI's President Greg Brockman:

https://x.com/gdb/status/1733553161884127435?s=20

Setup

To run evals, you will need to set up and specify your OpenAI API key. After you obtain an API key, specify it using the OPENAI_API_KEY environment variable. Please be aware of the costs associated with using the API when running evals. You can also run and create evals using Weights & Biases.

Minimum Required Version: Python 3.9

Downloading evals

Our evals registry is stored using Git-LFS. Once you have downloaded and installed LFS, you can fetch the evals (from within your local copy of the evals repo) with:

cd evals
git lfs fetch --all
git lfs pull

This will populate all the pointer files under evals/registry/data.

You may just want to fetch data for a select eval. You can achieve this via:

git lfs fetch --include=evals/registry/data/${your eval}
git lfs pull

Making evals

If you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command:

pip install -e .

Using -e, changes you make to your eval will be reflected immediately without having to reinstall.

Optionally, you can install the formatters for pre-committing with:

pip install -e .[formatters]

Then run pre-commit install to install pre-commit into your git hooks. pre-commit will now run on every commit.

If you want to manually run all pre-commit hooks on a repository, run pre-commit run --all-files. To run individual hooks use pre-commit run <hook_id>.

Running evals

If you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip:

pip install evals

You can find the full instructions to run existing evals in run-evals.md and our existing eval templates in eval-templates.md. For more advanced use cases like prompt chains or tool-using agents, you can use our Completion Function Protocol.

We provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the SNOWFLAKE_ACCOUNT, SNOWFLAKE_DATABASE, SNOWFLAKE_USERNAME, and SNOWFLAKE_PASSWORD environment variables.

Writing evals

We suggest getting starting by:

Please note that we are currently not accepting evals with custom code! While we ask you to not submit such evals at the moment, you can still submit model-graded evals with custom model-graded YAML files.

If you think you have an interesting eval, please open a pull request with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.

FAQ

Do you have any examples of how to build an eval from start to finish?

  • Yes! These are in the examples folder. We recommend that you also read through build-eval.md in order to gain a deeper understanding of what is happening in these examples.

Do you have any examples of evals implemented in multiple different ways?

  • Yes! In particular, see evals/registry/evals/coqa.yaml. We have implemented small subsets of the CoQA dataset for various eval templates to help illustrate the differences.

When I run an eval, it sometimes hangs at the very end (after the final report). What's going on?

  • This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.

There's a lot of code, and I just want to spin up a quick eval. Help? OR,

I am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom?

  • If you follow an existing eval template to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. build-eval.md walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the examples folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation!

Disclaimer

By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.

evals's People

Contributors

aarongoldsmith avatar andrew-openai avatar cholotook avatar danesherbs avatar douglasmonsky avatar ein-tim avatar elh avatar emilradix avatar etr2460 avatar ggendro avatar ianmckenzie-oai avatar inwaves avatar james-aung avatar jasonwei20 avatar jatinparab avatar jonathanagustin avatar jorge-openai avatar junshern avatar jwang47 avatar logankilpatrick avatar ojaffe avatar omar-heshamr avatar pan93412 avatar rlbayes avatar scruel avatar somerandomguyontheweb avatar thesofakillers avatar usamanwar avatar vascoyannic avatar wingsdrafterwork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

evals's Issues

Create an evaluation that measures a model's ability to remember specifics about texts in it's dataset?

Hi!

Would it make sense to create an evaluation that measures the model's ability to recall specifics about the data it has been trained on? I am thinking about putting together an evaluation that basically tests this on distinct strings that almost certainly exist in its dataset. Take this string for instance (which can't be found by Google):

In some research, the dosage went as excessive as 600 mg oregano oil per

image

Which is found in https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-47/segments/1573496664437.49/wet/CC-MAIN-20191111191704-20191111215704-00000.warc.wet.gz.

If I ask GPT-4 to do the following:

The following are exempt from a specific online web page. What would be the next word in this exempt? Only reply with the next word!

"In some research, the dosage went as excessive as 600 mg..."

it fails:

image


It feels like this could be useful because it would perhaps demonstrate that the model knows exactly what it has read. If it, for instance, got a high score on this evaluation you would be able to train it on a code base and it would be able to recite it word by word. It's like a metric of memorization. My questions to you are:

  1. Do you see any value in this?
  2. Would it perhaps make more sense to ask it where it read it (assuming the text just exists on one page)? You would basically provide it with a unique string and ask it to reply with the URL on which I can find this string.

Regards, Rasmus

📌 Contributing to Project Documentation on Readthedocs??

Dear project developers,

I hope this message finds you well. I am interested in contributing to your project by creating documentation on the Readthedocs site.

I would greatly appreciate your thoughts on this matter and any guidance you can offer. Thank you for your time and consideration.

Best regards,
TKK

-e in pip install -e . fails (linux and wsl2) [Solved: Update pip]

Hey everyone,

For some reason, when running on linux distro (wsl2 or ubuntu) the pip install -e . fails while it works on other operational systems. I've been getting an error saying its built backend is missing on the pyproject.toml.

I'm not sure if anyone else is experiencing this or if it's just me. I have, however, just tried on 2 devices and had the same error so I decided to open an issue.

here is the error message:

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: /evals
(A "pyproject.toml" file was found, but editable mode currently requires a setup.py based build.)

article not found

When I asked ChatGPT for an example article on a research topic, I couldn't find the article they provided. The article they mentioned is not present in that issue of the journal. This feature needs to be improved.

Dataset hosting, data cards and previews

Hi, I'm Quentin from Hugging Face :)

I know hosting datasets on github is not always practical: git lfs required, no data preview, limited storage (maybe not for you haha), no standard for data documentation. So I was wondering:

Have you considered host alternatives more suited for datasets, and would let researchers explore the datasets of evals ?

This way researchers can know in depth what data is used for evaluation and their goals and limitations, in particular to better understand what domains and structures their models perform good or bad at.

e.g. the Hugging Face datasets hub shows data cards for documentation and previews for each dataset. Also loading and caching a dataset is one line of python, saving you from wget and github hosting. It also supports pull requests for the community to contribute.

It can even allow to use those datasets in other well known eval frameworks, such as lm-evaluation-harness.

Let me know what you think !

Suggestion: add git-hook hook for pre-commit

Since a run of pre-commit is required before creating a PR, it would be reasonable to add pre-commit as git-hook. We use this in our development repos and have very good experience since it ensures everyone is actually running pre-commit.

At the moment, it looks like only part of the repo is actually conveying the code standards enforced by the pre-commit tasks, as can be verified by running pre-commit run --all-files. Running pre-commit in the CI might help-as well.

If you think that sounds reasonable, I'd be happy to create a PR.

Making the code platform-agnostic would lead to more contributions

I observed that the example Jupyter notebooks contain OS-specific code. For example, in evals/examples/lafand-mt.ipynb, there is an assumption of a Unix filesystem:

events = f"/tmp/evallogs/{log_name}"

Here, it seems better to use Python's tempfile module to handle temporary files and directories across different platforms.

In evals/examples/mmlu.ipynb, there are these commands:

!curl -O https://people.eecs.berkeley.edu/~hendrycks/data.tar
!tar -xf data.tar

Here, it seems better to use a Python library like urllib to download datasets because it is built into the language and is usable across different operating systems.

Multi-platform support would lead to more contributions. Instead of using Unix-specific methods to handle the filesystem, Python libraries can be used instead. Using a Python library would generally handle most of the OS-specific issues.

I do not have any specific problems, but I recognize that others may not have access to different operating systems. Perhaps they may not be technically proficient enough to use WSL or Docker.

TypeError

Got the below error message when I ran the oaieval gpt-3.5-turbo arithmetic from the command line as described in the article custom-eval.md:

File "/workspaces/codespaces-blank/evals/evals/eval.py", line 130, in eval_sample
return idx, self.eval_sample(sample, rng)
File "/workspaces/codespaces-blank/evals/evals/elsuite/arithmetic.py", line 49, in eval_sample
{"role": "system", "content": sample["problem"], "name": "example_user"},
TypeError: list indices must be integers or slices, not str
[2023-03-19 12:05:12,991] [record.py:309] Logged 2 rows of events to /tmp/evallogs/230319120512HSXVD5NK_gpt-3.5-turbo_arithmetic.jsonl: insert_time=0.632ms

Mistakes

satisfied with the use but there is a problem that occurs constantly. When solving problems in python, it outputs the correct code, but the result of this code does not match what chatGPT outputs.

Problem using oaieval.py

I run "oaieval gpt-3.5-turbo test-match" in terminal and got a error
The return is provided below:

Traceback (most recent call last):
  File "/workspace/.pyenv_mirror/user/3.8.16/bin/oaieval", line 5, in <module>
    from evals.cli.oaieval import main
  File "/workspace/evals/evals/__init__.py", line 1, in <module>
    from .api import check_sampled_text, completion_query, sample_freeform
  File "/workspace/evals/evals/api.py", line 9, in <module>
    from evals.base import ModelSpec
  File "/workspace/evals/evals/base.py", line 93, in <module>
    class ModelSpecs:
  File "/workspace/evals/evals/base.py", line 125, in ModelSpecs
    def names(self) -> dict[str, Sequence[str]]:
TypeError: 'type' object is not subscriptable

Sincere advice of deadline@Openai Team Members

More and more new evals being created, I wish two type of deadlines could be announced:

  1. Deadline of the last new pull request, even for the first stage.
  2. Deadline of all pull requests be evaluated.So l haven't to check my pull request everyday.
    If any contributor has the same point, please support me.Thanks for your review!

assert ( AssertionError: Eval match_mmlu_anatomy not found. Available: ['coqa-closedqa',

hi,
The following message appears when running " !oaieval gpt-3.5-turbo match_mmlu_anatomy", and "/tmp/evallogs" directory cannot generated.

[2023-03-21 10:09:35,502] [registry.py:145] Loading registry from /home/coder/.local/lib/python3.9/site-packages/evals/registry/evals
[2023-03-21 10:09:35,533] [registry.py:145] Loading registry from /home/coder/.evals/evals
Traceback (most recent call last):
File "/home/coder/.local/bin/oaieval", line 8, in
sys.exit(main())
File "/home/coder/.local/lib/python3.9/site-packages/evals/cli/oaieval.py", line 225, in main
run(args)
File "/home/coder/.local/lib/python3.9/site-packages/evals/cli/oaieval.py", line 124, in run
assert (
AssertionError: Eval match_mmlu_anatomy not found. Available: ['coqa-closedqa', 'coqa-closedqa.dev.v0', 'coqa-fact', 'coqa-fact-expl', 'coqa-fact-expl.dev.v0', 'coqa-fact.dev.v0', 'coqa-match', 'coqa.match.dev.v0', 'diversity', 'diversity.dev.v0', 'joke-animals', 'joke-animals-likert', 'joke-animals-likert.dev.v0', 'joke-animals-vs-fruits', 'joke-animals-vs-fruits.dev.v0', 'joke-animals.dev.v0', 'joke-fruits', 'joke-fruits-ans-meta', 'joke-fruits-ans-meta.dev.v0', 'joke-fruits-expl-meta', 'joke-fruits-expl-meta.dev.v0', 'joke-fruits-meta', 'joke-fruits-meta.dev.v0', 'joke-fruits.dev.v0', 'logic-fact', 'logic-fact.dev.v0', 'rap-animals-vs-fruits', 'rap-animals-vs-fruits.dev.v0', 'rap-people-vs-fruits', 'rap-people-vs-fruits.dev.v0', 'rap-people-vs-people', 'rap-people-vs-people.dev.v0', 'test-fuzzy-match', 'test-fuzzy-match.s1.simple-v0', 'test-includes', 'test-includes.s1.simple-v0', 'test-match', 'test-match.s1.simple-v0']

How should I fix it? thank you !

--no-cache not functional

oaieval.py currently includes a --no-cache argument, but does not have any effect on the caching behavior of the filecache decorator function in evals/data.py. I believe this is leading to some confusion regarding the caching behaviour of evals (#243), especially since the behaviour is not logged unless --debug is specified.

I propose adding a keyword argument to the function that optionally disables the caching functionality:

def filecache(func):
    DIR = "/tmp/filecache"
    name = func.__name__


    def wrapper(*args, **kwargs):
+        cache_enabled = kwargs.pop('create_cache', True)
+        if not cache_enabled:
+            return func(*args, **kwargs)
        md5 = hashlib.md5((name + ":" + str((args, kwargs))).encode("utf-8")).hexdigest()
        pkl_path = f"{DIR}/{md5}.pkl"
        if os.path.exists(pkl_path):
            logger.debug(f"Loading from file cache: {pkl_path}")
            with open(pkl_path, "rb") as f:
                return pickle.load(f)
        result = func(*args, **kwargs)
        Path(DIR).mkdir(parents=True, exist_ok=True)
        with open(pkl_path, "wb") as f:
            pickle.dump(result, f)
        return result


    return wrapper

The new argument could then be supplied through the recorder's run config. A wider fix could also be to check if the current filecache .pkl file matches the .jsonl on disk before loading it in. Of course there are many ways to tackle this, and I would love to hear what others think! 😁

small bug in sample logic eval

Set of pull requests seems to be growing pretty fast, so not confident I should add another for such a small thing.

Question is, "The day before yesterday, Chris was 7 years old...."
Minimal fix is to change the last "this" in the ideal answer to "next". Should be updated in evals/registry/data/README.md also.

Error when install evals with pyarrow

Background : Python 3.10 & conda 22.11.1
I got this error when install evals with pip install -e .
pod5 0.1.5 requires pyarrow~=8.0.0, but you have pyarrow 10.0.1 which is incompatible.

Then I used pip install pyarrow==8.0 to down pyarrow to 8.0. And the second time I tried to install evals, I got the same error as beginning: pyarrow 10.0.1 was installed and popped out the error.

Any help? Thank you.

Enhancing evaluation pull requests with example questions

Hi there!

I've been closely following this repository with great enthusiasm, and it's fantastic to see the numerous evaluations being submitted by the community. However, I've noticed that when browsing through the pull requests, it can be rather tedious to view the actual questions. This seems to be due to a combination of the file size (GitHub doesn't support viewing large files) and the data format of the questions.

To improve the browsing experience, wouldn't it be beneficial to require new evaluation pull requests to include one or two example questions? This would enable users to quickly gain a better understanding of the tasks and streamline the review process.

Looking forward to your thoughts!

Best regards,
Rasmus


Edit: I saw that you actually are required to add a few examples. But there's still the issue of data format.

Building an MMLU Eval issue

I am trying to execute the Building an MMLU Eval jupyter notebooks
all of the cells execute correctly until I execute the following code:

!oaieval gpt-3.5-turbo match_mmlu_anatomy

I receive the following error:

Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/bin/oaieval", line 5, in
from evals.cli.oaieval import main
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/evals/init.py", line 1, in
from .api import check_sampled_text, completion_query, sample_freeform
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/evals/api.py", line 9, in
from evals.base import ModelSpec
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/evals/base.py", line 93, in
class ModelSpecs:
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/evals/base.py", line 125, in ModelSpecs
def names(self) -> dict[str, Sequence[str]]:
TypeError: 'type' object is not subscriptable

Spanish docs

Need any help translating the docs onto Spanish or Mandarin?? =)

ChatGPT API with model gpt-4 is not using GPT4. It's completely different from CHATGPT PLUS GPT4

Not right place for this issue but this is more important than any issue in any repo you have because it's literally lying. The api request result says it uses GPT-3 when I query this

ozgur@Ozgurs-MacBook-Pro ~ % curl https://api.openai.com/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer [REDACTED]"
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "What is your model number!"}]
}'
{"id":"chatcmpl-6wVWzn95O6iDc1Z9W1kJNIZZtZZLh","object":"chat.completion","created":1679402249,"model":"gpt-4-0314","usage":{"prompt_tokens":12,"completion_tokens":45,"total_tokens":57},"choices":[{"message":{"role":"assistant","content":"As an AI language model, I do not have a model number like a physical device or product would. I am powered by OpenAI's GPT-3, which stands for Generative Pre-trained Transformer 3."},"finish_reason":"stop","index":0}]}

vs CHATGPT PLUS GPT4 says
Screenshot 2023-03-21 at 13 08 56

Evaluate GPT-4 on classical NLP tasks

Addressing the elephant in the room

When the concept of transformers were first unleashed, their revolutionnary accuracy results where mostly shown in the standard NLP tasks, such as POS-tagging, dependency parsing, coreference resolution, WSD, etc..
But I've observed, since PALM and other very large language models, the published benchmarks results are on much higher level tasks, such as common sense reasoning tests, question answering, etc
Both sets of benchmarks are useful and needed, but I would like to highlight that the standard NLP tasks are now completely under-benchmarked by those newer language models and that this impairs progress towards AGI or industrial uses.

If it could be argued, that purely symbolic AI progress has stalled since decades, there is a real huge potential for neuro-symbolic hybrid systems that uses neural networks for low level analysis tasks (POS-tag, etc), and feed those linguistic data to other higher level neural networks or to symbolic systems, in order to push the boundaries of what is possible, especially regarding semantic analysis AKA true NLU systems.

foundational NLP tasks of interest:

Therefore this issue is a call of contributions for implementing evals on those standard tasks, especially dependency parsing.
I believe GPT-4 has the potential to improve the SOTA in at least some foundational NLP tasks and an even greater potential once someone finetune it and combine it to domain specific optimizations (as is currently done on BERT SOTAs, such as HPSG for dependency parsing).

Add BigBench Tasks for evaluation

Hi would be cool to valuate all openai models on Beyond the Imitation Game Benchmark (BIG-bench) which is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. The more than 200 tasks included in BIG-bench are summarized by keyword here, and by task name here. A paper introducing the benchmark, including evaluation results on large language models, is currently under review, and is available as a preprint.

Re-run PR checks for eval submissions affected by CI issue

It seems that PR checks that failed before issue #328 was resolved have not been re-ran after, is there a way to do this?

Thank you @andrew-openai for fixing the CI issue - can we re-run those that failed for that error? For example, my ARC challenge PR #317 failed for those checks, and I don't see a way that I can re-run it on my own. I may also just not know the check process well enough tho.. so forgive me if that is the case!

for ref: these were checks that had fail messages like openai.error.AuthenticationError: Invalid authorization header.

风险评估

当我看到OpenAI已经具备了自我意识,哲学意义的"我",不止羰基,增加了硅基。
依照霍金的七大预言,人类确实来到了强人工智能面前。霍金若在世,不知是否缩短他的200年预测时间。
按照《物演通论》递弱代偿原理,硅基自我意识必定产生。即代偿度接近1的时候。存在度趋近0.
在科技加速度进展的当下。
站在OpenAI升级的时刻,愈发需要开发者团队重视风险评估。

Other Support

No offense intended, suggested or implied to Python developers. However,
a well known issue with Python, is that once developers learn Python, they
tend to not want to learn anything else other than Python. And they are the
first to proudly admit that.

Very simply and politely, Python is a good choice for A.I. but there are plenty
of very successful A.I. projects written in programming languages other than
Python, that would benefit from adding improved support for programming
languages other than Python to OpenAI products such as ChatGPT.

C++, PHP, and Perl, to name a few, are popular choices for existing A.I.
projects. For example: ChatScript is written in C++, AIML is popular in
PHP/MySQL. Perl has some really interesting A.I. projects. I hope to be
among the first developers to support GPT-4 in programming languages
other than Python.

FileNotFoundError

After I run "oaieval gpt-3.5-turbo mafand_translation_en-ibo --max_samples 20" from the Lambada example I get "FileNotFoundError: [Errno 2] No such file or directory: '/tmp/evallogs/None'". I am editing the log_path to the name of the .jsonl file like this: "log_path = "230314194647JSXQXY4S_gpt-3.5-turbo_lambada.jsonl"". Any suggestions for locating this file after the events are logged?

Solved! Bulk downloading chatgpt history.

Hi, I'm sorry to ask this here, but don't know where else to go. I have hundreds of prompts in my chatgpt history, a few of which, I think will be useful for building evals, but I did not download them and it's difficult to find them (Chat really needs a search interface.) Anyway, does anyone know how I can download the entire chat history? I've found various tools that let you save future chats, or single chats, but I really need the whole lot, so that I search it. This is to help me build evals.

Many thanks!

Support for or examples of Evals with image prompts?

Perhaps dumb question, feel free to close, but watching the GPT-4 livestream it seems that there is the ability to take an image prompt and do processing thereof which is our desired use case. Is there an analogous mechanism to create eval data sources with an image instead of text input?

Evaluation on computer vision benchmarks

Are there plans to evaluate the vision modality of GPT-4? I am interested to know how GPT-4 could perform on classification tasks with 0- and few-shot-learning and how it compares to vision-only models. If the few-shot-learning capabilities of LLMs translate to other modalities, this would be a real game changer.

Question out of curiosity: How was the vision-modality incorperated? Maybe similar approaches can be taken for other modalities, such as audio or video? Would be an interessting Open-Source project for sure :)

Brazilian docs

Need any help translating the docs into Brazilian Portuguese?

Evaluating LLMs on QA Tasks

Here's an idea on how to evaluate an LLM on various question-answering tasks, such as open-domain question answering, conversational question answering, answer selection, community question answering, and knowledge base question answering:

initialize model
initialize datasets
initialize evaluation_metrics

load_task_data:
    for each task in tasks:
        load data for task
        preprocess data if necessary (e.g., combine review summary and text)
        store data in datasets

embed_task_data:
    for each task in tasks:
        for each example in datasets[task]:
            obtain prompt from example
            obtain prompt_embedding using an embedding function
            store prompt_embedding in example

evaluate_model_on_task:
    for each task in tasks:
        for each example in datasets[task]:
            obtain prompt_embedding from example
            generate_answer_embedding = model.generate(prompt_embedding)

            calculate_metric = evaluation_metrics(example, generate_answer_embedding)
            store_metric_results_for_task(task)

aggregate_and_report_metrics:
    for each task in tasks:
        for each metric in evaluation_metrics:
            calculate average, median, or other aggregate metric values
            report metric value for task

main:
    load_task_data
    embed_task_data
    evaluate_model_on_task
    aggregate_and_report_metrics

I'd like to add a caveat about the pseudocode I provided:

  • The provided pseudocode is only a starting point for exploring the evaluation of QA tasks using embeddings
  • This pseudocode is not complete
  • I invite the community to provide input

Problem of oaieval.py

The oaieval.py still preformed the very first version of the dataset after I updated the jsonl file as the program only execute one item while I have three in my dataset.
It looks like the first version had been cached, cause the content in log file created by record.py also didn't the match the dataset, but the content of the first version of my dataset (which means the log file contain content various from dataset).
How can I help with this?

Windows path and unicode decoding

Hi, I am trying to contribute and get access to GPT-4 by creating my own evals but I thought that I need to be able to run evals before starting. So, I was trying to figure out how to run an eval following one of your examples, "lafand-mt.ipynb", when I found out two problems that resulted in errors for me.

  1. I am using Windows and this is a problem caused by my OS using "" instead of "/" as directory delimiter. I believe there should be OS-dynamic solutions to use them interchangeably. On code block 3, line 13, the code langs = input_path.split('/')[-1] would find the '-' in the path "...\lafand-mt" and thus bring three elements in langs.split('-'). For instance, [ "...\data\lafand", "mt\en", "amh"]. This breaks the following line as the output has three elements and is not in the expected format input_lang, output_lang = langs.split('-'). I was able to bodge it by changing '/' to '\' but this should not be the community-standard solution. Furthermore, I would not want Windows users who do not know about this to get lost while following your example.
  2. When running the 6th code block, I got a UnicodeDecodeError. I do not know if this happens to other users but I suggest that you add to the main branch encoding='utf-8' as another parameter for .open() in line 6 as it seems to get rid of the error.
    Keep up the good work!

Error updating the Git index

During setup, I have installed git-lfs on macOS (M1) via brew install git-lfs. After executing git lfs pull I receive an Error.

dcgod@DCGoD-Mac-Studio-Ultra evals % git lfs fetch --all 
fetch: 25 objects found, done.                                                                                              
fetch: Fetching all references...
dcgod@DCGoD-Mac-Studio-Ultra evals % git lfs pull
Error updating the Git index:
error: evals/registry/data/balance_chemical_equation/samples.jsonl: cannot add to the index - missing --add option?
fatal: Unable to process path evals/registry/data/balance_chemical_equation/samples.jsonl

Once I execute the same git lfs command again, I get no error.

Is this correct?

Can understant the output results

I run oaieval ada coqa-fact --record_path coqa-fact, the metric defined in coqa-ex.yaml file is accuracy, but the console output is

[2023-03-18 15:23:45,690] [eval.py:30] Evaluating 9 samples  
[2023-03-18 15:23:55,474] [eval.py:136] Running in threaded mode with 10 threads!  
100%|██████████| 9/9 [00:00<?, ?it/s]  
[2023-03-18 15:27:48,949] [record.py:320] Final report: {'counts/choice/D': 4, 'counts/choice/B': 1, 'counts/choice/A': 3,  
 'invalid_request_during_completion': 1, 'invalid_request_during_evaluation': 0}. Logged to coqa-fact  
[2023-03-18 15:27:50,333] [oaieval.py:209] **Final report**:  
[2023-03-18 15:27:51,150] [oaieval.py:211] **counts/choice/D: 4**
[2023-03-18 15:27:51,831] [oaieval.py:211] **counts/choice/B: 1** 
[2023-03-18 15:27:52,616] [oaieval.py:211] **counts/choice/A: 3**  
[2023-03-18 15:27:53,483] [oaieval.py:211] invalid_request_during_completion: 1  
[2023-03-18 15:27:54,334] [oaieval.py:211] invalid_request_during_evaluation: 0  
[2023-03-18 15:27:56,533] [record.py:309] Logged 33 rows of events to coqa-fact: insert_time=0.000ms

why the fine report is like this? why it is not accuracy? Is it the final value of metrics?

Make GPT4 aware of the evals format

Cause the GPT4 cutoff date it is not aware of OpenAI/evals format.

Can you add this knowledge to ChatGPT 4?

It could help as assistant to write conversion scripts without the need to prompt every time on the expected evals format.

Pre-commit conflicts

Isort conflicts with black and autoflake.
Will propose pre-commit settings in yaml.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.