ruixiangcui / agieval Goto Github PK

License: MIT License

Python 100.00%

agieval's Introduction

AGIEval

This repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.

Introduction

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. For a full description of the benchmark, please refer to our paper: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.

Tasks and Data

We have updated the dataset to version 1.1. The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows:

AGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below.

You can download all post-processed data in the data/v1_1 folder. All usage of the data should follow the license of the original datasets.

The data format for all datasets is as follows:

{
    "passage": null,
    "question": "设集合 $A=\\{x \\mid x \\geq 1\\}, B=\\{x \\mid-1<x<2\\}$, 则 $A \\cap B=$ ($\\quad$)\\\\\n",
    "options": ["(A)$\\{x \\mid x>-1\\}$", 
        "(B)$\\{x \\mid x \\geq 1\\}$", 
        "(C)$\\{x \\mid-1<x<1\\}$", 
        "(D)$\\{x \\mid 1 \\leq x<2\\}$"
        ],
    "label": "D",
    "answer": null
}

The passage field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the label field. The answer for cloze tasks is saved in the answer field.

We provide the prompts for few-shot learning in the data/few_shot_prompts file.

Baseline Systems

We evaluate the performance of the baseline systems (gpt-3.5-turbo and GPT-4o) on AGIEval v1.1. The results are as follows:

You can replicate the results by following the steps below:

Update your OpenAI API in the openai_api.py file.
run the run_prediction.py script to get the results.

Evaluation

You can run the post_process_and_evaluation.py file to get the evaluation results.

Leaderboard

We report the leaderboard on AGIEval v1.1. The leaderboard contains two subsets AGIEval-en and AGIEval-zh. The two subset leaderboards contain only MCQ tasks. The leaderboard is as follows:

AGIEval-en few-shot

Model	Source	Average
GPT-4o	Link	71.4
Llama 3 400B+	Link	69.9
Llama 3 70B	Link	63
Mixtral 8x22B	Link	61.2
GPT-3.5-Turbo	Link	52.7
Llama 3 8B	Link	45.9
Gemma 7B	Link	44.9
Mistral 7B	Link	44

AGIEval-zh few-shot

Model	Source	Average
GPT-4o	Link	71.9
GPT-3.5-Turbo	Link	49.5

AGIEval-all few-shot

Model	Source	Average
GPT-4o	Link	69.0
GPT-3.5-Turbo	Link	47.2

AGIEval-en zero-shot

Model	Source	Average
GPT-4o	Link	65.2
GPT-3.5-Turbo	Link	54.1

AGIEval-zh zero-shot

Model	Source	Average
GPT-4o	Link	63.3
GPT-3.5-Turbo	Link	45.0

AGIEval-all zero-shot

(Asterisk sign indicates results reported for AGIEval v1.0.)

Model	Source	Average
GPT-4o	Link	62.3
InternLM2-20B*	Link	53.0
Qwen-14B*	Link	52.0
Phi-3-medium 14b*	Link	50.2
InternLM2-Chat-7B-SFT*	Link	49.0
GPT-3.5-Turbo	Link	46.0
Qwen-7B*	Link	45.6
Mixtral 8x7b*	Link	45.2
Phi-3-small 7b*	Link	45.1
Gemma 7b*	Link	42.1
Llama-3-In*	Link	42.0
Phi-3-mini 3.8b*	Link	37.5
Mistral 7b*	Link	35.1
Phi-2 2.7b*	Link	29.8

Citation

If you use AGIEval benchmark or the code in your research, please cite our paper:

@misc{zhong2023agieval,
      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, 
      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
      year={2023},
      eprint={2304.06364},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

agieval's People

Contributors

Stargazers

Watchers

agieval's Issues

parse_result error in gaokao-physics-zero-shot

model output is :
"model_output": "选项 (B) $3.3\mathrm{MeV}$。", "parse_result": ["B", "M", "V"], "label": "B", "is_correct": false；
as gaokao-physics has multi-answer, it will take all uppercase letters， which makes the correct answer become an error.

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-Z])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

maybe we can make a candidate answer list like ["A", "B", "C", "D", "E", "F"] to reduce the prob of error?

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-F])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

Error in gaokao-chemistry dataset

The options are wrong in this data
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L108

{"passage": null, "question": "2007年3月21日，我国公布了111号元素Rg的中文名称．该元素名称及所在周期是（　　）", "options": ["錀   第七周期", "镭　第七周期", "(C)铼　第六周期", "(D)氡　第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

It should be

{"passage": null, "question": "2007年3月21日，我国公布了111号元素Rg的中文名称．该元素名称及所在周期是（　　）", "options": ["(A)錀   第七周期", "(B)镭　第七周期", "(C)铼　第六周期", "(D)氡　第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

Unicode escape sequences in the json data

If you inspect aqua-rat.jsonl (and other datasets), there are unicode escape sequences throughout the data.

{"passage": null, "question": "A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45\u00b0 to 60\u00b0. After how much more time will this car reach the base of the tower?", "options": ["(A)5(\u221a3 + 1)", "(B)6(\u221a3 + \u221a2)", "(C)7(\u221a3 \u2013 1)", "(D)8(\u221a3 \u2013 2)", "(E)None of these"],

This can be prevented by going back to the original script you used to write out the data and adding ensure_ascii=False and encode('utf-8') before writing to your file, like so:

f.write(json.dumps(row, ensure_ascii=False)+ '\n').encode('utf8'))

could you please provide me detailed eval results on qwen1.5-14b?

There is only average score of qwen1.5-14b in README. Could you please provide me detailed eval results?

Several problems in logiqa-zh

There are several problems in logiqa-zh, e.g.

[ "A 没有党参", "B 没有首乌", "C 有白术", "D 没有白术" ]

and it should be

[ "(A)没有党参", "(B)没有首乌", "(C)有白术", "(D)没有白术" ]

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

About API_dic

How to get the custum_api_name?Why i have some error?

multi-thread n = 3
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1

How to post-processed data from raw JEC-QA data?

I already download JEC-QA data, so how can I generate the post-processed data from it? Could you provide the official processing scripts?

Empty choice in gaokao-chemistry dataset

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L78

Empty choice (D) in these datas.

Many "passage" fields in agieval dataset are empty

dirty data in gaokao-bio and gaokao-geo

### Tasks

Details about the data collection

Thanks for your awesome work! I notice that Gaokao is an important part in your dataset, but most Gaokao papers are not freely available online. Could you please explain how to collect the Gaokao dataset? Thanks in advance :)

how to do multi_choice task， and prefix is not used？

https://github.com/ruixiangcui/AGIEval/blob/624021ed76ddb82046b97803ae95d0cb90c0738d/src/dataset_loader.py#L57C1-L57C44
prefix = "该问题为单选题，所有选项中必有一个正确答案，且只有一个正确答案。\n"

this prefix is not used,
and I found jec-qa, gk-physics only outputs one choice under the chatglm2 model.
run multi choice task,
What kind of prompt can output multiple choices?

Gaokao and SAT source?

where are Gaokao and SAT datasets from?

the few-shot-prompt format is different in gaokao-biology dataset

The fifth problem in gaokao-biology dataset only has 3 options, and the explanation gives 4.

SAT-Math corpus includes incomplete data

in sat-math corpus, it happens to have incomplete question, which may make it insufficient to solve.

{"passage": "", "question": "Which of the following is equivalent to the expression above?" ...

Multiple choice in gaokao-mathqa dataset

There are about 7 multiple choice questions in gaokao-mathqa dataset, e.g.
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-mathqa.jsonl#L149

{"passage": null, "question": "函数 $f(x)=\\sin (2 x+\\varphi)(0<\\varphi<\\pi)$ 的图象以 $\\left(\\frac{2 \\pi}{3}, 0\\right)$ 中心对称, 则 ($\\quad$)\\\\\n", "options": ["(A)$y=f(x)$ 在 $\\left(0, \\frac{5 \\pi}{12}\\right)$ 单调递减", "(B)$y=f(x)$ 在 $\\left( -\\frac{\\pi}{12}, \\frac{11 \\pi}{12}\\right)$ 有 $2$ 个极值点", "(C)直线 $x= \\frac{7 \\pi}{6} $ 是一条对称轴", "(D)直线 $y= \\frac{\\sqrt{3}}{2} - x $ 是一条切线"], "label": "AD", "answer": null, "other": {"source": "2022年全国新高考II卷数学"}}

which doesn't match the format in gaokao-physics, i.e. ["A", "D"] .

gaokao-english dirty data

The gaokao-english has a dirty data.

The question is

The engineer Camillo Oliver was 40 years old when he started the company in 1908. At his factory in Ivrea, he designed and produced the first Italian typewriter. Today the company's head office s still in Ivrea, near Turin, but the company is much larger than it was in those days and there are offices all around the world.By 1930 there was a staff of 700 and the company turned out 13,000 machines a year. Some went to customers in Italy, but Olivetti exported more typewriters to other countries.Camillo's son, Adriano, started working for the company in 1924 and later he became the boss. He introduced a standard speed for the production line and he employed technology and design specialists. The company developed new and better typewriters and then calculators(计算机). In 1959 it produced the ELEA computer system. This was the first mainframe（主机）computer designed and made in Italy.After Adriano died in 1960, the company had a period of financial problems. Other companies, especially the Japanese, made faster progress in electronic technology than the Italian company. In 1978, Carlo de Benedetti became the new boss. Olivetti increased its marking and service networks and made agreements with other companies to design and produce more advanced office equipment. Soon it became one of the world's leading companies in information technology and communications. There are now five independent companies in the Olivetti group—one for personal computers, one for Systems and services, and two for telecommunications.

The option is:

like:

['(A)It produced the best typewriter in the world.     ', '(B)It designed the world’s firs![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAXwAAAAkCAMAAAC9k3HWAAADAFBMVEUAAACAAAAAgACAgAAAAICAAIAAgICAgIDAwMD/AAAA/wD//wAAAP//AP8A//////8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADMAAGYAAJkAAMwAAP8AMwAAMzMAM2YAM5kAM8wAM/8AZgAAZjMAZmYAZpkAZswAZv8AmQAAmTMAmWYAmZkAmcwAmf8AzAAAzDMAzGYAzJkAzMwAzP8A/wAA/zMA/2YA/5kA/8wA//8zAAAzADMzAGYzAJkzAMwzAP8zMwAzMzMzM2YzM5kzM8wzM/8zZgAzZjMzZmYzZpkzZswzZv8zmQAzmTMzmWYzmZkzmcwzmf8zzAAzzDMzzGYzzJkzzMwzzP8z/wAz/zMz/2Yz/5kz/8wz//9mAABmADNmAGZmAJlmAMxmAP9mMwBmMzNmM2ZmM5lmM8xmM/9mZgBmZjNmZmZmZplmZsxmZv9mmQBmmTNmmWZmmZlmmcxmmf9mzABmzDNmzGZmzJlmzMxmzP9m/wBm/zNm/2Zm/5lm/8xm//+ZAACZADOZAGaZAJmZAMyZAP+ZMwCZMzOZM2aZM5mZM8yZM/+ZZgCZZjOZZmaZZpmZZsyZZv+ZmQCZmTOZmWaZmZmZmcyZmf+ZzACZzDOZzGaZzJmZzMyZzP+Z/wCZ/zOZ/2aZ/5mZ/8yZ///MAADMADPMAGbMAJnMAMzMAP/MMwDMMzPMM2bMM5nMM8zMM//MZgDMZjPMZmbMZpnMZszMZv/MmQDMmTPMmWbMmZnMmczMmf/MzADMzDPMzGbMzJnMzMzMzP/M/wDM/zPM/2bM/5nM/8zM////AAD/ADP/AGb/AJn/AMz/AP//MwD/MzP/M2b/M5n/M8z/M///ZgD/ZjP/Zmb/Zpn/Zsz/Zv//mQD/mTP/mWb/mZn/mcz/mf//zAD/zDP/zGb/zJn/zMz/zP///wD//zP//2b//5n//8z///9EYrBQAAAAEXRSTlP/////////////////////ACWtmWIAAAABYktHRACIBR1IAAAADGNtUFBKQ21wMDcxMgAAAANIAHO8AAAMLElEQVRoQ+1bzW7bZhbNA5ELewLYzWKeIVpMCiSb5h2iTQI0U0Te+A28KRcJ0RSDqn0MEagthO7C8xiVAJEBrGDOOff7pagmaROnwJhCTJP8+P3c79xzz71y7ty7e/v5Uha4c/d6g0+rn7fnG7XDvTv3zOq3n5u3wN07dzeb6zf4196eb9oOt8j/gj4P5Avz5Lrb8w3b4X3I3/oINMqI/f35zTPl5xhxcbK/1/oPnu2+1bWDe6v5Xt+KnE/e57/zBy3Oi+InXXfTA/Tmn+m8Kg/99eW0PAnvYR+ydpvNauddN8ag3fC9m79exnVgbtuqmMc5NHG94/NObNRNy2Jgr7os5nvWO4L8ZYm92lalML8uvzpJd247LXF4vL+bRlT0Ez4ZHIdfkFE/0EO2FVazPI4evuIajuLbTbHX+4GlzTta5KgjvhutvrhKR+7O8uv0GTl/oO9h39l1C3xjzO3UzYK/V+z5u6OkfTd9GPKDbjob5gnL4iro5qY8/RvlEevycchrqnK2WR/AYt2TeY8lct3XvzoL4vc1rZfo/ypcb6vDV6Wt64Jmn43kSatS+NT7YOksjxjj/HUJHGDHgIhFggibbWO7sRXm7ad9+skO/1/Ae91TgAJ7U+eO8Q3fSkbozg7Qvpsm/rKd2vOas+GIFf0y/xjg4ug2SoLdi9JhBLiytxel2YSf+titExg2ZOWfC3JA5teYt32a0EufjJ++LwZzrd8Cu5mNxnR+B7hjlUDA5ZP5UP/D+OL2foLVXQL5Pj84n8x4f1th//n8281mSf7T8yWNsdk+y42P2AEfxziWZ4AwYRv4XMlYoffIoYwlOM80Lgxr/YdxhTkZ2+7HDT6K/cIC7A82P7J+4cW+nwbtlgdXHP/hWL4D/1UsfORi4uRhGL/G3G393cTm5+ZleNDxSEStGWLcl6k9h8jfvgS6vvf705BJMG6ylw1wYrsOq+ScP2ekN4/ZXK/AfcKMecUQU0SSMI1peoyadxBPoTV+VW/dmZ2xiQNfxFbBsnhu8yI3Wn/YBIdjcomtCN3Rt6yd52L6sjh/W43qGvpvU87fPbU33yYYb4KPjvi95l3vZ/zN9VDnXz61vZTu77/VeYFVhDwAOLHnq0Nhhe10Dc7vvp+3/eTkHCttl+Vp6zi/hSHIqUkeQduf2rVhmdxKa2kctwvthkjVe+ILXTsM+/mcT+A56IfMwnawveuX/eEdzb8U77TEu38fPQrn2BH4jDi/rdBHWI9rb6A9bV+BBWy+AJfPB4h8YPP+/Jp+P5InrbW2fXnUEPngpXgcAGh2+Lgqzt9OPY8NOX/75Ao4oeeItTzngzTyeADbB0TACvIV3HOoJI7FzXjP4bzDkPQmux8/aKn1+w/srPjnsB98wPyCvBZYdw1RqHY1Wi2Of5sW/w3I72NEoO5UO88HlZ+leZKOYt5P/Lh9MkNEND8bRPNs5rw/0Pld5fR9otn7yUw6NexEcXX53J6T84POJ+evDp8BT5hnN33hOB/Pl8RbonVl73DNnWiJ7XgPeMY8LAaoHdq8AK+Ga58vYPUYJ/SNvOMojgN/o+ZmN/A83JeI9O3R1tZFCz7unr7pXlPOBXPGdRnnvwEHK9ZBJ/lnRD7j1g/45/qFlcJapQ/joffTfwPkLwP6bMc6WjLVrf3kn3pyqZ8Dzg8c10/EdY7zMYdMRXBOqSeAkw/JPsk9mOnYsw9H6oH8XnuR4p7eEpQHn4hpQgtcEbN8j7vvdHjg9cbNyuk3cX730mJDorfeeg9Zyo8u0hGsXQ9vYlzR8c1196OfgR/BvM7iZ/rJdb5EhZ6bR9Lbf5dNvG+T/+McdnV+bTHTdLTj/N8n4Ew3quPg7BqewUNri+Polr/uJ8XPpoWSfqhsPHsZM8RdtnGUZ8Au/za2qcpj30IxhvxPzvfjVg8Nachu4jhUT9aPxQVjHcdY4rwV5nWhdUcL835dfh37cbyUzX9X5wMv6ZErU7KxrVej7ej87uxrYWwlD3KcjziSIhaoTjS4Ic1jM+JCt4LPiTp2FD4wnekTXKceJetypsUvE84WfjJrolqCHlNr6nx96MeNzT6ilCuGD2LG3RmeQU8k2K21E2/Rz9rFrTj+tjpu6+jxe5Cf6tNc51JN5jp/WXx3cDllXBD37ej81ZHut9Uh0gCv84FAr9sVJ8jneb/g+PJ1Pg9GAsUF6WjlOMalyfcO4tfkGtuaXhvnw3KHGPMxtVLR2lzUj+IInzMfQDxEjHmxWaEdVHSc77L4D3CPLBiaDXWsRTrP7TPpfPL9QhrIax7mO9PHyGtMq1msGOQnuP8HVc2uzvjUYaM5xqyJlQYeJKw4fFLpsk5Ctn9bXDXlY8/5CdoM5UP2E65z/S5rR12hULmTQdeIFSmHAuNJfOpqaRzmreCjA2hujAEO8l4ITAvFXrK0rNH2/6I/B43COED/lS/0xVV3lkWv+h/yyGLemQ9Eta+aRJ7Rj3P+qA6F+iAWyVG/prq2ORBWTA+D84E+4YR8t1mdULe/7qpT7FAxd5wPBEb9DDM6fR30L3BZPkC+n8yD9Tukw0F3WzxztZPQDgSe6XKpqqCrqeSZj/A9MhD28yGwGfIA93xTMTemhRlL22qGWGGxRflDMV9zfa9gS/g5Y2vU7SgFsB3eI85lD9W34Dsur4jzA+fv2Hk/8iHUDVV1UBQr5npRVwsywTu466/IigfXK3I6ihqO81PkA4eJTjbcMs1ixpnyJfU6BZBDNvj5NwrNFOeMrzH/sDiUIh8Mw/ZSQNiBU11jfP+OQ77ltfBVq6BdHF6m6K6xF6GqiUpG9AnG1zPfF2Nuhny3sugn/bjaSbQnuXfssLpJN/0Be0oeJVbJHgOdvzStC82M8+IQGa74dWF6W89Y3c7r/uBT+gJwKC1u9SH4AcfzdR98h1D8RCZyvOnaAflZX4yoQ82PehJ0Orp6VKkuj59WS+I9+qCtYYU4VDNHyHOQFes+Tue7PDvR6oHH3zQ2d1ff8nPYwqn877FttPd+5CeRetNJIanSU4ObG1+zyXX+z69tvxdeTzmd7/Q2nwGBw/oJa27SDFlFR3oo6iJWilRZzLCPUOo8w+qh2NoYORSNVRWiUlG5jV4avcPVi95NlV+Dzy3GrdPslsoTEHKjwNWz2a9i7crmNazwZFXc9+n8qLNb7VpyjVmJz8D58f5oPb8PdX3H+dR0Pm/w1YCgd2Ej6WcXiV0e4LjN6j5cldicYTfW4aUdze993SHuFmpLVs9hjMGZ9R3xNUOArQtvk9u7KTFD3obguW5dtcPmt+C8Qz1/Uc7S7wGQx1gsAyLdfHOdDw/+WJ3vdhlO6H9zZxsHaife/7B6vthANmSRMu8VXjGs6JBZ3Ri+7qMaqfZAbEdeEkrRH87K4m0E2lgjhJomehOyvZ4C2Tj0Oh9AjBLy5yurbBZz/92By+N9bVaKR4rffUIV1DH+CPL/pM4PfDX4ex5xvtfbI/X8VHc7vYz1mt4dftN4RP0edLer7Vtdx+ljV9sn51NHe70Pi6qWqboMj1BXlybVQT5XXUn5AdWs6lAhtlD/cx21ajWr8jVrN6htUe/Tv8K6oaXZjlpIMcPVqaBtntg8VcdS+0Tnq8b6p3X+DvJtvzPk79R2AipMGzoE+yoAs8X0OGL1P1HmK7L/u1AzNb6xemRoJ573yHffMGX6yeqyAZ+NU0A4+5q+1WR9Xai2q+KXBxxb7/mqqlux5cneG+FTbry1U3pNpvhSz35/hrtbbzaE6jufTEcrHwTPhfuDen7enl8W2fuOm/fWtcfG4Xuf4j42VP3gmzf/vYPyXtaFeEZWJcxe1cxNVHOy79TCOum/28r5AmOAedpm+0T9bNYxb7G6RJh3Z5zP9eP4cJ2/y862o6rTR9ZLv8PdyVtfhZb7ehtEldDvDdz3uuvC1eH751DRcdyLuMq6aJVbRa++L29d2L11kmsMv8kNyIcvxe9yQ0+7f7djOtb4bFB/5vWP6T37G5/Rdtl95gVp3f1D3vm8bRCZTe+fu3HO30BPj47ZzJA07q6xPde9xsUWvXue/83O9mV4r3s+0vf7/mLtE2GQNfVdnXuTSM/HYhE78eBPtMqPXc/I3+1oJkGHu3n95evqb/Z3O/HvkT7Pej/EbjeE/I/FxP9HeyD/S/2fmNtx7/0PigAMta/NGbAAAAAASUVORK5CYII=)t mainframe computer.', '(C)It exported more typewriters than other companies.', '(D)It has five independent companies with its head office in Ivrea.']

The option B has some dirty string.

There is a format error in the data, and an error may be reported when parsing json. In addition, it is strongly recommended to clean the data to provide users with higher quality evaluation data.

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75
{"passage": null, "question": "水溶液呈酸性的是( $)$", "options": ["(A)$\\mathrm{NaCl}$", "(B)$\\mathrm{NaHSO}_{4}$", "(C)HCOONa", "(D)$\mathrm{NaHCO}_{3}"], "label": "B", "answer": null, "other": {"source": "2020年浙江省高考化学【7月】"}}
Option D is missing a backslash \

Will human evaluation results be public?

I am interested in the human evaluation result, but there are only 4 pictures. So I want to konw whther the result(detailed or overall numeric results) will be public?

Bug in Dataset Loader for Few-Shot Multiple Choice Questions

I've noticed that the current code uses the expression demo + question. However, I believe the correct expression should be demo + question_input. By using demo + question, the previously defined question_input is not being utilized and some multiple-choice questions may lack options in the prompt. Please consider updating the code to reflect this change for proper functionality. Thank you!

https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L215

In gaokao-chemistry.jsonl, line 190's options include invalid options (which is actually the question's analysis). The length of options is actually 7 not 4.

After option "D", there is a fifth option.
Missing options.

In sat-en-without-passage.jsonl, line 17's options miss option D which should be "They may increase in value as those same resources become rare on Earth." reference
In sat-en-without-passage.jsonl, line 57's options miss option D which should be "No, because the data do not indicate whether the honeybees had been infected with mites." while the label is "D". reference
In sat-en-without-passage.jsonl, line 98's options miss option D which should be "Published theories of scientists who developed earlier models of the Venus flytrap". You can refer to question 11 in reference.

The same goes for sat-en.jsonl in line 17, 57 and 98.

In jec-qa-kd.jsonl, line 212's label is empty. The content is also dirty.

the few-shot-prompt format is different in gaokao-geography dataset

The few-shot prompts in gaokao-geography dataset looks like this:

{'passage': None, 'question': '在某城市中心，一种创新型绿色建筑一垂直森林高层住宅落成面世。它是在建筑的垂直方向上，覆盖满本地乔木、灌木和草本等植物，为每层住户营造“空中花园”，形成具有森林效应的生态居住群落。与传统设计相比，“垂直森林”在居住空间设计上变化最大的地方是（ ）', 'options': ['A. 阳台\tB. 客厅\tC. 卧室\tD. 厨房'], 'label': 'A', 'answer': None, 'other': {'source': '2022年湖北省高考地理试题'}}