GithubHelp home page GithubHelp logo

agieval's Introduction

AGIEval

This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.

Introduction

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.

Tasks and Data

We have updated the dataset to version 1.1. The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows:

AGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below. The datasets used in AGIEVal

You can download all post-processed data in the data/v1_1 folder. All usage of the data should follow the license of the original datasets.

The data format for all datasets is as follows:

{
    "passage": null,
    "question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1<x<2\\}$, 则 $A \\cap B=$ ($\\quad$)\\\\\n",
    "options": ["(A)$\\{x \\mid x>-1\\}$", 
        "(B)$\\{x \\mid x \\geq 1\\}$", 
        "(C)$\\{x \\mid-1<x<1\\}$", 
        "(D)$\\{x \\mid 1 \\leq x<2\\}$"
        ],
    "label": "D",
    "answer": null
}

The passage field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the label field. The answer for cloze tasks is saved in the answer field.

We provide the prompts for few-shot learning in the data/few_shot_prompts file.

Baseline Systems

We evaluate the performance of the baseline systems (gpt-3.5-turbo and GPT-4o) on AGIEval v1.1. The results are as follows:

The datasets used in AGIEVal

You can replicate the results by following the steps below:

  1. Update your OpenAI API in the openai_api.py file.
  2. run the run_prediction.py script to get the results.

Evaluation

You can run the post_process_and_evaluation.py file to get the evaluation results.

Leaderboard

We report the leaderboard on AGIEval v1.1. The leaderboard contains two subsets AGIEval-en and AGIEval-zh. The two subset leaderboards contain only MCQ tasks. The leaderboard is as follows:

AGIEval-en few-shot

Model Source Average
GPT-4o Link 71.4
Llama 3 400B+ Link 69.9
Llama 3 70B Link 63
Mixtral 8x22B Link 61.2
GPT-3.5-Turbo Link 52.7
Llama 3 8B Link 45.9
Gemma 7B Link 44.9
Mistral 7B Link 44

AGIEval-zh few-shot

Model Source Average
GPT-4o Link 71.9
GPT-3.5-Turbo Link 49.5

AGIEval-all few-shot

Model Source Average
GPT-4o Link 69.0
GPT-3.5-Turbo Link 47.2

AGIEval-en zero-shot

Model Source Average
GPT-4o Link 65.2
GPT-3.5-Turbo Link 54.1

AGIEval-zh zero-shot

Model Source Average
GPT-4o Link 63.3
GPT-3.5-Turbo Link 45.0

AGIEval-all zero-shot

(Asterisk sign indicates results reported for AGIEval v1.0.)

Model Source Average
GPT-4o Link 62.3
InternLM2-20B* Link 53.0
Qwen-14B* Link 52.0
Phi-3-medium 14b* Link 50.2
InternLM2-Chat-7B-SFT* Link 49.0
GPT-3.5-Turbo Link 46.0
Qwen-7B* Link 45.6
Mixtral 8x7b* Link 45.2
Phi-3-small 7b* Link 45.1
Gemma 7b* Link 42.1
Llama-3-In* Link 42.0
Phi-3-mini 3.8b* Link 37.5
Mistral 7b* Link 35.1
Phi-2 2.7b* Link 29.8

Citation

If you use AGIEval benchmark or the code in your research, please cite our paper:

@misc{zhong2023agieval,
      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, 
      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
      year={2023},
      eprint={2304.06364},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

agieval's People

Contributors

eureka6174 avatar microsoft-github-operations[bot] avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar ruixiangcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agieval's Issues

parse_result error in gaokao-physics-zero-shot

image
model output is :
"model_output": "选项 (B) $3.3\mathrm{MeV}$。", "parse_result": ["B", "M", "V"], "label": "B", "is_correct": false;
as gaokao-physics has multi-answer, it will take all uppercase letters, which makes the correct answer become an error.

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-Z])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

maybe we can make a candidate answer list like ["A", "B", "C", "D", "E", "F"] to reduce the prob of error?

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-F])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

Error in gaokao-chemistry dataset

The options are wrong in this data
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L108

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["錀   第七周期", "镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

It should be

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["(A)錀   第七周期", "(B)镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

Unicode escape sequences in the json data

If you inspect aqua-rat.jsonl (and other datasets), there are unicode escape sequences throughout the data.

{"passage": null, "question": "A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45\u00b0 to 60\u00b0. After how much more time will this car reach the base of the tower?", "options": ["(A)5(\u221a3 + 1)", "(B)6(\u221a3 + \u221a2)", "(C)7(\u221a3 \u2013 1)", "(D)8(\u221a3 \u2013 2)", "(E)None of these"], 

This can be prevented by going back to the original script you used to write out the data and adding ensure_ascii=False and encode('utf-8') before writing to your file, like so:

f.write(json.dumps(row, ensure_ascii=False)+ '\n').encode('utf8'))

Several problems in logiqa-zh

There are several problems in logiqa-zh, e.g.

[ "A 没有党参", "B 没有首乌", "C 有白术", "D 没有白术" ]

and it should be

[ "(A)没有党参", "(B)没有首乌", "(C)有白术", "(D)没有白术" ]

About API_dic

How to get the custum_api_name?Why i have some error?

multi-thread n = 3
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1

Details about the data collection

Thanks for your awesome work! I notice that Gaokao is an important part in your dataset, but most Gaokao papers are not freely available online. Could you please explain how to collect the Gaokao dataset? Thanks in advance :)

SAT-Math corpus includes incomplete data

in sat-math corpus, it happens to have incomplete question, which may make it insufficient to solve.

{"passage": "", "question": "Which of the following is equivalent to the expression above?" ...

Multiple choice in gaokao-mathqa dataset

There are about 7 multiple choice questions in gaokao-mathqa dataset, e.g.
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-mathqa.jsonl#L149

{"passage": null, "question": "函数 $f(x)=\\sin (2 x+\\varphi)(0<\\varphi<\\pi)$ 的图象以 $\\left(\\frac{2 \\pi}{3}, 0\\right)$ 中心对称, 则 ($\\quad$)\\\\\n", "options": ["(A)$y=f(x)$ 在 $\\left(0, \\frac{5 \\pi}{12}\\right)$ 单调递减", "(B)$y=f(x)$ 在 $\\left( -\\frac{\\pi}{12}, \\frac{11 \\pi}{12}\\right)$ 有 $2$ 个极值点", "(C)直线 $x= \\frac{7 \\pi}{6} $ 是一条对称轴", "(D)直线 $y= \\frac{\\sqrt{3}}{2} - x $ 是一条切线"], "label": "AD", "answer": null, "other": {"source": "2022年全国新高考II卷数学"}}

which doesn't match the format in gaokao-physics, i.e. ["A", "D"] .

gaokao-english dirty data

The gaokao-english has a dirty data.

The question is

The engineer Camillo Oliver was 40 years old when he started the company in 1908. At his factory in Ivrea, he designed and produced the first Italian typewriter. Today the company's head office s still in Ivrea, near Turin, but the company is much larger than it was in those days and there are offices all around the world.By 1930 there was a staff of 700 and the company turned out 13,000 machines a year. Some went to customers in Italy, but Olivetti exported more typewriters to other countries.Camillo's son, Adriano, started working for the company in 1924 and later he became the boss. He introduced a standard speed for the production line and he employed technology and design specialists. The company developed new and better typewriters and then calculators(计算机). In 1959 it produced the ELEA computer system. This was the first mainframe(主机)computer designed and made in Italy.After Adriano died in 1960, the company had a period of financial problems. Other companies, especially the Japanese, made faster progress in electronic technology than the Italian company. In 1978, Carlo de Benedetti became the new boss. Olivetti increased its marking and service networks and made agreements with other companies to design and produce more advanced office equipment. Soon it became one of the world's leading companies in information technology and communications. There are now five independent companies in the Olivetti group—one for personal computers, one for Systems and services, and two for telecommunications.

The option is:

like:

['(A)It produced the best typewriter in the world.     ', '(B)It designed the world’s firs![]()t mainframe computer.', '(C)It exported more typewriters than other companies.', '(D)It has five independent companies with its head office in Ivrea.']

The option B has some dirty string.

There is a format error in the data, and an error may be reported when parsing json. In addition, it is strongly recommended to clean the data to provide users with higher quality evaluation data.

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75
{"passage": null, "question": "水溶液呈酸性的是( $)$", "options": ["(A)$\\mathrm{NaCl}$", "(B)$\\mathrm{NaHSO}_{4}$", "(C)HCOONa", "(D)$\mathrm{NaHCO}_{3}"], "label": "B", "answer": null, "other": {"source": "2020年浙江省高考化学【7月】"}}
Option D is missing a backslash \

Will human evaluation results be public?

I am interested in the human evaluation result, but there are only 4 pictures. So I want to konw whther the result(detailed or overall numeric results) will be public?

Bug in Dataset Loader for Few-Shot Multiple Choice Questions

I've noticed that the current code uses the expression demo + question. However, I believe the correct expression should be demo + question_input. By using demo + question, the previously defined question_input is not being utilized and some multiple-choice questions may lack options in the prompt. Please consider updating the code to reflect this change for proper functionality. Thank you!

https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L215

Dirty data in the dataset.

Hi, when I parse the dataset's options, I found unnormal behavior which the length of options is different from others in the same subcatrgory.

  1. In gaokao-chemistry.jsonl, line 190's options include invalid options (which is actually the question's analysis). The length of options is actually 7 not 4.
    20230817-170517
    After option "D", there is a fifth option.
    20230817-170556

  2. Missing options.

  • In sat-en-without-passage.jsonl, line 17's options miss option D which should be "They may increase in value as those same resources become rare on Earth." reference
    20230817-171359

  • In sat-en-without-passage.jsonl, line 57's options miss option D which should be "No, because the data do not indicate whether the honeybees had been infected with mites." while the label is "D". reference
    img_v2_83f511ea-27ce-45ab-a43e-df788a0fbe0g

  • In sat-en-without-passage.jsonl, line 98's options miss option D which should be "Published theories of scientists who developed earlier models of the Venus flytrap". You can refer to question 11 in reference.
    img_v2_5ad1f5fc-cd5d-4a2d-a607-94296e2c4abg

The same goes for sat-en.jsonl in line 17, 57 and 98.

  1. In jec-qa-kd.jsonl, line 212's label is empty. The content is also dirty.
    img_v2_e9f4cde5-a876-465b-9968-f743fb24040g
    img_v2_46ad402b-05a6-4605-900f-c2b089fd082g

the few-shot-prompt format is different in gaokao-geography dataset

The few-shot prompts in gaokao-geography dataset looks like this:

{'passage': None, 'question': '在某城市中心,一种创新型绿色建筑一垂直森林高层住宅落成面世。它是在建筑的垂直方向上,覆盖满本地乔木、灌木和草本等植物,为每层住户营造“空中花园”,形成具有森林效应的生态居住群落。与传统设计相比,“垂直森林”在居住空间设计上变化最大的地方是( )', 'options': ['A. 阳台\tB. 客厅\tC. 卧室\tD. 厨房'], 'label': 'A', 'answer': None, 'other': {'source': '2022年湖北省高考地理试题'}}

It should be

'options': ['(A)阳台', '(B)客厅', '(C)卧室', '(D)厨房']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.