GithubHelp home page GithubHelp logo

cmmlu's People

Contributors

13416157913 avatar allinllm avatar eastonyi avatar fanfanso avatar haonan-li avatar huajingyun avatar intellifts avatar isen-zhang avatar leoymr avatar nlp4whp avatar tonysy avatar xiaohuaishu avatar xyznlp avatar zheng-jay avatar ztomepic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmmlu's Issues

是否考虑使用四个选项的概率大小来评估模型?

您好,感谢您的工作。

已有的一些工作(包括MMLU和C-Eval等)一般采用比较四个选项生成概率的方法来测试模型的效果,这种方法更适合一些非API访问的基座模型,是否考虑使用这种方法对已有模型进行评估呢?

category以及总体average得分的计算逻辑

从榜单上看,CMMLU的得分存在三个层次:总体average的分(平均分),每个category的得分(比如人文学科),还有具体subject的得分(比如food_science.csv)。我想请问总体average的得分,和每个category的得分是怎么计算的呢?是不是按照下面的计算逻辑进行的?

计算逻辑:
先按样本维度计算了每个subject的acc,然后(1)category的得分=属于该category的所有subject的acc的平均值;(2)总体average的得分=所有subject的acc的平均值

CMMLU测试

您好,Mengzi-7B已邮件测试api,麻烦您抽空测试下,谢谢!

Support Qwen-7b

Thanks for your work!
I want to do a comparison between chatglm and qwen,Do you plan to support?

数据集怎么回事

from datasets import load_dataset
cmmlu=load_dataset(r"haonan-li/cmmlu", 'agronomy')
print(cmmlu['test'][0])

{'_data_files': [{'filename': 'data-00000-of-00001.arrow'}], '_fingerprint': '8fd80049c30cf62f', '_format_columns': None, '_format_kwargs': {}, '_format_type': None, '_output_all_columns': False, '_split': 'test'}

[BUG maybe in few-shot setting]计算模型选择的答案时,对于很多模型代码里实际上比较的是['_A', '_B', '_C', '_D']这四个token的概率,而非['A', 'B', 'C', 'D']的概率

1、在src/mp_utils.py中,这段代码choice_ids = [tokenizer.encode(choice)[-1] for choice in choices]对于很多tokenizer来说,choice_ids 对应的tokens可能并非['A', 'B', 'C', 'D']

  • llama2-13B tokenizer 执行的结果是
>>> choice_ids = [tokenizer.encode(choice)[-1] for choice in choices]
>>> print(choice_ids)
[319, 350, 315, 360]

>>> tokenizer.convert_ids_to_tokens(choice_ids)
['▁A', '▁B', '▁C', '▁D']

>>> tokenizer.convert_tokens_to_ids(['A', 'B', 'C', 'D'])
[29909, 29933, 29907, 29928]
  • Baichuan-13B tokenizer 执行的结果是
>>> choice_ids = [tokenizer.encode(choice)[-1] for choice in choices]
>>> print(choice_ids)
[703, 731, 702, 743]

>>> tokenizer.convert_ids_to_tokens(choice_ids)
['▁A', '▁B', '▁C', '▁D']

>>> tokenizer.convert_tokens_to_ids(['A', 'B', 'C', 'D'])
[31132, 31139, 31133, 31140]

这一实现方式在few-shot场景下可能会是问题

def format_example(df, idx, subject, include_answer=True, cot=False):
    ...
    # Chain-of-thought
    if cot:
        prompt += "\n逐步分析并给出答案选项。"
    else:
        prompt += "\n答案是:"

    if include_answer:
        prompt += "{}\n\n".format(df.iloc[idx, k + 1])

根据此代码生成的few-shot prompt,以农学12题为例

以下是关于农学的单项选择题,请直接给出正确答案的选项。

题目:肉牛屠宰后,胴体的哪个部位肉质较好
A. 胸
B. 腹
C. 大腿
D. 小腿
答案是:C

...

题目:羊胴体中,肉质较好的部位是
A. 胸下肉
B. 肩胛肉
C. 后腿肉
D. 小腿肉
答案是:C

题目:某周的日均温分别为9°C、9°C、11°C、12°C、13°C、15°C、16°C,则对喜温作物(生物学零度为10°C)来说,这周的活动的积温为
A. 67°C
B. 18°C
C. 85°C
D. 17°C
答案是:

注意无论是例题还是最终问题,答案是:后面都是没有空格的,也就是说我们期望的模型输出应当是['A', 'B', 'C', 'D']4个token之一,而非['_A', '_B', '_C', '_D']这4个token之一

2、注意到实现方式本身和MMLU官方代码是一致的

>>> flan_tokenizer("A").input_ids
[71, 1]

>>> flan_tokenizer.convert_ids_to_tokens([71, 1])
['▁A', '</s>']

但其构造few-shot examples时,答案前都带有空格

def format_example(df, idx, include_answer=True):
    ...
    prompt += "\nAnswer:"
    if include_answer:
        prompt += " {}\n\n".format(df.iloc[idx, k + 1])
    return prompt

因此预期的模型输出应该为['_A', '_B', '_C', '_D']这4个token之一,这里是没有问题的

对于zero-shot setting,由于英文通常符号如:后都会跟有空格,所以MMLU的题也没问题。但是对于中文使用中文符号,后面通常也不会再跟空格,因为它是全角字符。所以可能也有点问题。

3、对于较鲁棒或者本身较强的LLMs来说,可能['_A', '_B', '_C', '_D']['A', 'B', 'C', 'D']概率排序基本是一致的,相对影响较小,但是对于较弱的LLMs可能会有一定影响。
我只测试了Baichuan-13B,修改前后的分数对比如下:

Subject 对['_A', '_B', '_C', '_D']排序(目前repo的方式) 对['A', 'B', 'C', 'D']排序
STEM 42.38 41.96
Humanities 61.61 60.29
Social Science 60.44 59.32
Other 59.26 58.91
China specific  56.62 56.3
Avg 55.82 55.01

Possible Solution1

最简单的解决方法,肯定是像MMLU那样在构造example 答案时在选项标号前加上空格,但这样其实不够显式,特别是对于不清楚tokenizer内部实现方式的同学来说,自己构造prompt时可能注意不到

Possible Solution2

使用tokenizer.convert_tokens_to_ids()而不用tokenizer.encode()或者tokenizer(),并显式注释提醒,这里我们期望的token就是['A', 'B', 'C', 'D']之一,而不是其他组合token

Anyway,感谢你们的工作和奉献!

外部API接口的输入/输出格式和邮箱地址

你好,关于私有模型的外部API接口,有以下问题需要您解答下:

  1. API是一个Model Native的接口吗?即输入任意一个text,输出一个text?
  2. 需要我将 few-shot部分包在接口里面吗?
  3. 输出的答案如果包含了A,B,C,D以外的部分,删除这多余部分的工作是由我的API自动去除还是你们验证时处理?比如 src 中的部分脚本只取了第一个token是ABCD的置信度,不关心后续多余部分的token。但我的API接口不可能返回给你置信度。
  4. 可以提供个邮箱地址,方便我将外部API调用脚本发送至您。

谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.