open-compass / vlmevalkit Goto Github PK

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks

Home Page: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

License: Apache License 2.0

Python 95.56% Shell 0.12% Jupyter Notebook 4.32%

gpt-4v large-language-models llava multi-modal openai vqa llm openai-api qwen gpt

vlmevalkit's People

Contributors

Stargazers

Watchers

vlmevalkit's Issues

word_count_mean of COCO Caption

How is the "word_count_mean" in the COCO Caption results computed? I can't find it in the scripts.

【提问】关于其他测试集的支持

hi,
感谢你们团队的工作。
我想咨询一下，

请问后续会支持GQA,OKVQA,CMMMU这些测试集的推理评估吗？
后续会像opencompass一样，支持调用api进行多模态评估吗？
chartqa，textvqa这些分数对不上官方论文的数值，请问后续会进行优化吗？

[Feature Request] RealWorldQA Benchmark

Would you like to add this benckmark (introduction, dataset) to VLMEval? It's a multi-choice dataset and can be easily added to this repo.

Some examples:

Enhancing Multi-Choice Question Handling with Case-Sensitive Matching

It might be beneficial to implement a mechanism for exact matching of uppercase option letters in multi-choice questions. This could help avoid confusion caused by the presence of quantifiers like "a" in responses, which might be mistakenly interpreted as indicating multiple choices. Additionally, in cases where multiple letters or multiple instances of "yes"/"no" appear, the system could prioritize the analysis of the first word in the sentence to determine the intended response.

I am also curious about whether the scores currently displayed on the OpenCompass leaderboard have been updated to reflect these latest modifications. Could you provide any information on this?

Evaluation of custom models and datasets.

VLMEVALKIT is a pretty convenient evaluation tool for MLLMs. I hope that the esteemed authors can create a framework for VLMEVALKIT that supports the evaluation of custom models and custom datasets. This framework can define a unified MLLM input-output interface and the conversion format for datasets.

Does vlmeval support multi card inference and batch size > 1?

where is the vlmeval/utils/data_util.py

Excellent work:)
But I can`t fine the data_util file in the lasted version..
And I would like to ask when the Internlm2 7b model will be supported.Thanks!

(feature request) can we add load_dotenv() as a small quality of life improvement?

Hi OpenCompass VLMEvalKit team,

Thank you for your hard work on this project! I have a very minor feature request - can we add load_dotenv to make it easier for users to run without explicitly setting their OPENAI_API_KEY env variables in the terminal before their run?

This way, a user can add their key to the variable once in a .env file and it will automatically be loaded.
Happy to open a pull request if helpful.

How do you compute MME(normalized)?

These two download links are no longer accessible...

Pos_tokens.reshape error while testing mmmu on internlm-xcomposer2-7b

We meet error while testing mmmu on internlm-xcomposer2-7b

Looking forward to your reply.

How to calculate the avg score?

Hi! How could i get the avg score in your Multi-modal Leaderboard?

Dose there have any formula to do this?

Thx

Cannot reproduce llava v1.5 7b SEEDBench_IMG results

When i use default setting to initiate llava v1.5 7b model evaluation on SEEDBench_IMG dataset , I got results like this :

I checked intermediate results , but the model seems to generate options correctly :

The default generation config should be

do_sample=True
temperature=0.2
max_new_tokens=512
top_p=None
num_beams=1

And the officially reported results should be

It's really weird . I don't know why there is a huge gap here. Hope to get help. Thank you in advance!

There is a long gap between the validation accuracy of the dataset of vlmevalkit and the model paper

On the TextVQA dataset, the paper in Instructblip 13b indicates that its precision is 50.7, and the paper in Qwen VL Chat shows an accuracy of 63.75.
In terms of the accuracy measured by the vlmevalkit official, the accuracy of Instructblip 13b is about 30, and the accuracy of QWEN VL Chat is 10.5, what do you think is the problem?
Also, I tested the accuracy of Instructblip 13b on textVQA and found that I ran with an accuracy of 16.7, what went wrong? These are all the results of prefech, and GPT is not used.

How to calculate overall score for HallusionBench

It seems it can produce aACC, qACC and fACC for the bench, but your leaderboard shows of its overall score. Do I have to turn to its original paper to access overall score?
Thx!

What is the estiamted runtime across benchmarks, and OpenAI api cost?

If i want to use model of LLaVa in VLMEval, which version should i installed for these?

The error below appears when I install LLava v1.1.3 after having installed latest VLMEval:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.                                                                            
jupyterlab 4.1.2 requires httpx>=0.25.0, but you have httpx 0.24.0 which is incompatible.                                    
xtuner 0.1.13 requires transformers!=4.34.1,!=4.35.0,!=4.35.1,!=4.35.2,>=4.32.1, but you have transformers 4.31.0 which is incompatible.                                                                                                                  
vlmeval 0.1.0 requires gradio==4.15.0, but you have gradio 3.35.2 which is incompatible.                                     
vlmeval 0.1.0 requires transformers==4.33.0, but you have transformers 4.31.0 which is incompatible.

ModuleNotFoundError: No module named 'xtuner.parallel'

I met this problem when test by cmd:
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.244.33 --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

Traceback (most recent call last):
File "/code/src/VLMEvalKit/run.py", line 153, in
main()
File "/code/src/VLMEvalKit/run.py", line 83, in main
model = infer_data_job(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 210, in infer_data_job
model = infer_data(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 142, in infer_data
response = model.generate(prompt=struct['text'], image_path=struct['image'], dataset=dataset_name)
File "/code/src/VLMEvalKit/vlmeval/vlm/llava_xtuner.py", line 177, in generate
from xtuner.model.utils import prepare_inputs_labels_for_multimodal
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/init.py", line 3, in
from .sft import SupervisedFinetune
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/sft.py", line 16, in
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
ModuleNotFoundError: No module named 'xtuner.parallel'
[2024-04-03 03:58:31,981] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2207729) of binary: /bin/python

MBench_TEST_CN和MMBench_TEST_EN的tsv里没有answer列

您好，这里下载的MMBench_TEST_CN和MMBench_TEST_EN数据集里没有answer，无法进行评估

ChartQA augmented & CMMMU

It would be nice if VLMEvalKit can support evaluation on ChartQA augmented set and CMMMU, since it has supported ChartQA human set and MMMU

Unknown error when loading LLaVA model

when I run the command
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model llava_v1.5_13b --verbose
it shows
warnings.warn('Unknown error when loading LLaVA model.')

how to deal with this?

The md5 hashtest of the dataset DocVQA_VAL hasn't passed.

I checked the Completeness of DocVQA_VAL after downloading it from here. Not sure which one is correct, please take a correct.

[疑问] 实验分数对不上，使用llava_v1.5_7b在MMMU_DEV_VAL测试集上的分数对不上

环境：

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda-11.7',
 'GCC': 'gcc (GCC) 8.4.1 20200928 (Anolis 8.4.1-1.0.1)',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-PCIE-40GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
 'OpenCV': '4.9.0',
 'PyTorch': '1.13.1+cu117',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201402\n'
                              '  - Intel(R) Math Kernel Library Version '
                              '2020.0.0 Product Build 20191122 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.6.0 (Git Hash '
                              '52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.7\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
                              '  - CuDNN 8.5\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
                              '-Wno-narrowing -Wall -Wextra '
                              '-Werror=return-type -Werror=non-virtual-dtor '
                              '-Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
 'TorchVision': '0.14.1+cu117',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.2+4bd2256',
 'sys.platform': 'linux'}

命令：python run.py --data MMMU_DEV_VAL --model llava_v1.5_7b --verbose

输出：

llava_v1.5_7b_MMMU_DEV_VAL_acc.csv 文件保存的结果是如下所示。

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.006666666666666667","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.2","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.02857142857142857"
"validation","0.014444444444444444","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.06666666666666667","0.0","0.13333333333333333","0.1","0.06666666666666667","0.0","0.0","0.0","0.0","0.03333333333333333","0.016666666666666666","0.02666666666666667","0.009523809523809525"

请问为什么用这个命令推理评估的结果与官方差距那么大？

[Feature Request] To evaluate MMMU test set, you need to transfer the xlsx output to a json file

Hello,

When using VLMEvalKit with MMMU_TEST, you will generate a xlsx output file, e.g.,

This format cannot be accepted by the online MMMU EvalAI server. The server requires this json format.

The following code can transfer the xlsx file to the required json format:

import pandas as pd
import json

# 读取xlsx文件
def read_xlsx(file_path):
    # 使用pandas读取xlsx文件
    df = pd.read_excel(file_path, engine='openpyxl')
    return df

# 转换为单个字典的json格式
def convert_to_single_json(df):
    # 选择第一列和第23列
    selected_columns = df.iloc[:, [0, 22]]
    
    # 创建一个空字典用于存储结果
    result_dict = {}
    
    # 遍历每一行数据
    for index, row in selected_columns.iterrows():
        # 使用第一列的值作为键，第23列的值作为值
        result_dict[row[0]] = row[1]
        
    # 将字典转换为json格式的字符串
    json_data = json.dumps(result_dict, indent=4)
    
    return json_data

# 主函数
def main():
    # xlsx文件路径
    file_path = 'hpt-air-mmmu_MMMU_TEST.xlsx'  # 请替换为你的xlsx文件路径
    
    # 读取xlsx文件
    df = read_xlsx(file_path)
    
    # 转换为单个字典的json格式
    json_data = convert_to_single_json(df)
    
    # 输出json数据
    print(json_data)
    
    # 将json数据保存到文件
    with open('hpt-air-mmmu_MMMU_TEST.json', 'w') as f:
        f.write(json_data)

if __name__ == '__main__':
    main()

Would you like to add it into VLMEval?

Best,
StarCycle

upload failed on mmbench website

https://mmbench.opencompass.org.cn/mmbench-submission

Error Encountered in Multi-Node Evaluation Using Distributed Arguments

I encountered an issue while attempting to perform a multi-node evaluation using PyTorch's torchrun with specific distributed arguments. Below is the command I used, including the distributed arguments setup and the execution command:


DISTRIBUTED_ARGS=" \
    --nproc_per_node 3 \
    --nnodes 4 \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run.py \
    --data MME MMBench_DEV_EN MMBench_DEV_CN CCBench SEEDBench_IMG MMMU_DEV_VAL MathVista_MINI HallusionBench LLaVABench \
    MMBench_TEST_EN MMBench_TEST_CN \
    --model llava

Upon execution, I received the following error message:

RUN - ERROR - No such file or directory: './llava/312_MME.pkl'
It seems like only the .pkl file on node0 was saved correctly. Only 012,112,212 was saved.
Thank you in advance for your assistance!

Consider integrating MMStar?

https://github.com/MMStar-Benchmark/MMStar/tree/main

llava 34B 评测时CUDA out of memory

你好，我在评测llava 34B 时，有8张卡可用，发现只占了一张卡进行推理，导致CUDA out of memory。请问llava 34B的评测支持多卡评测吗？谢谢

Questions regarding the metrics for SEED bench

Hi,

Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.

Specifically, I have three questions.

What does "heuristic matching" mean in ExactMatchRate?
I'm not fully understanding the definition of MatchedAcc and ExactMatchAcc (and the difference between them). Would you mind explaining it with a concrete example?
It is mentioned, for the official SEED leaderboard, that For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant. I understand what PPL-based evaluation means (ranking options by perplexity), but what does generation-based evaluation mean here?

Thank you in advance for your help.

Can not reach the info_vqa dataset

The link of info_vqa is unreachable.

the same as DocVQA_TEST InfoVQA_VAL InfoVQA_TEST

HallusionBench的最终结果如何计算

HallusionBench最后的结果包含aAcc，fAcc，qAcc，请问最后的准确率是如何计算的？是否是这三者的平均？另外MME的评测只包含preception部分吗？排行榜上好像是既包含preception又包含cognition。

Error when evaluating MMBench_TEST_EN

Looks like the annotation in MMBench_TEST_EN does not contains the key "answer"? Yet the prefetch_acc has to access this key, as shown in

VLMEvalKit/vlmeval/inference.py

Line 156 in 7f099c1

if matched == item['answer']:

Numpy compliation issue during installing

Hello. I created a new conda environment to install this project with python 3.10. According to the error message below it seems like numpy was unable to find the right blas library.

c/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c'
gcc: numpy/core/src/multiarray/alloc.c
gcc: numpy/core/src/multiarray/buffer.c
gcc: numpy/core/src/multiarray/common.c
gcc: numpy/core/src/multiarray/array_assign_scalar.c
gcc: numpy/core/src/multiarray/descriptor.c
gcc: numpy/core/src/multiarray/conversion_utils.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/einsum.c
gcc: numpy/core/src/multiarray/datetime_strings.c
gcc: numpy/core/src/multiarray/arrayobject.c
gcc: numpy/core/src/multiarray/array_assign_array.c
gcc: numpy/core/src/multiarray/ctors.c
gcc: numpy/core/src/multiarray/convert.c
gcc: numpy/core/src/multiarray/calculation.c
gcc: numpy/core/src/multiarray/datetime_busday.c
gcc: numpy/core/src/multiarray/arrayfunction_override.c
gcc: numpy/core/src/multiarray/convert_datatype.c
gcc: numpy/core/src/multiarray/hashdescr.c
gcc: numpy/core/src/multiarray/datetime_busdaycal.c
gcc: numpy/core/src/multiarray/item_selection.c
gcc: numpy/core/src/multiarray/compiled_base.c
gcc: numpy/core/src/multiarray/dragon4.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/arraytypes.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/lowlevel_strided_loops.c
gcc: numpy/core/src/multiarray/multiarraymodule.c
gcc: numpy/core/src/multiarray/datetime.c
gcc: numpy/core/src/multiarray/dtype_transfer.c
gcc: numpy/core/src/multiarray/nditer_constr.c
gcc: numpy/core/src/multiarray/iterators.c
gcc: numpy/core/src/multiarray/refcount.c
gcc: numpy/core/src/multiarray/scalarapi.c
gcc: numpy/core/src/multiarray/nditer_pywrap.c
gcc: numpy/core/src/multiarray/sequence.c
gcc: numpy/core/src/multiarray/shape.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘cfloat_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘clongdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2997:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| double
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2997:12: error: too few arguments to function ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2998:1: warning: control reaches end of non-void function [-Wreturn-type]
2998 | }
| ^
gcc: numpy/core/src/multiarray/temp_elide.c
gcc: numpy/core/src/multiarray/vdot.c
gcc: numpy/core/src/umath/umathmodule.c
gcc: numpy/core/src/multiarray/typeinfo.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/loops.c
gcc: numpy/core/src/multiarray/usertypes.c
gcc: numpy/core/src/multiarray/number.c
gcc: numpy/core/src/umath/reduction.c
gcc: numpy/core/src/umath/ufunc_object.c
gcc: numpy/core/src/umath/ufunc_type_resolution.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/nditer_templ.c
gcc: numpy/core/src/multiarray/flagsobject.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/ieee754.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/npy_math_complex.c
gcc: numpy/core/src/multiarray/getset.c
gcc: numpy/core/src/umath/override.c
gcc: numpy/core/src/npymath/halffloat.c
gcc: numpy/core/src/multiarray/nditer_api.c
gcc: numpy/core/src/common/array_assign.c
gcc: numpy/core/src/common/ucsnarrow.c
gcc: numpy/core/src/npymath/npy_math.c
gcc: numpy/core/src/common/mem_overlap.c
gcc: numpy/core/src/common/ufunc_override.c
gcc: numpy/core/src/common/numpyos.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/common/npy_cpu_features.c
gcc: numpy/core/src/common/npy_longdouble.c
gcc: numpy/core/src/umath/extobj.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/scalarmath.c
gcc: numpy/core/src/multiarray/mapping.c
gcc: numpy/core/src/multiarray/methods.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/matmul.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/clip.c
error: Command "gcc -pthread -B /home/ubuntu/mambaforge-pypy3/envs/vlme/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/umath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Inumpy/core/include -Ibuild/src.linux-x86_64-3.10/numpy/core/include/numpy -Inumpy/core/src/common -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c -o build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d" failed with exit status 1
[end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for numpy
  Failed to build numpy
  ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
  [end of output]

OPENAI_API_KEY is not set properly, will use exact matching for evaluation

Hello,

Where shall I set the OPENAI_API_KEY?

Detailed results of ScienceQA-IMG

Thanks for the great effort of this repo! I see you provide the zero-shot results of several MLLMs on ScienceQA-IMG dataset. Could you please add the detailed results (i.e., NAT, SOC, LAN) of the TEST and VAL partitions?

llava_v1.5_7b wrong results on Seedbench_IMG

Hi,

I have checked the saved result of llava7b on the seedbench_image benchmark and found that: for some problems, the llava7b can give the right prediction but in the evaluation framework it shows that it will return wrong answers.

For example, for index = 1198,4307

Which object is likely found in the boy's hand? A: A book B: A soccer ball C: A calculator D: A pencil

Where is the priest located in the image? A: In front of the stained glass window B: To the right of the bride and groom C: To the left of the bride and groom D: Behind the bride and groom

Can you help explain this? Thanks for your help!

A major problem with the multiple-choice evaluation

There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.

Because his prediction is all a single word--just the options, so I tried to match it myself, just if item['prediction']==item['answer'], and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.

If you want the result file to test, I can send you the result file, or you can just have a check.

How to calculate the average rank?

In the leaderboard, [LLaVA-InternLM2-20B (QLoRA)] get higher average score than Monkey-Chat, but Monkey-Chat rank higher. So how to calculate the Avg.Rank as the leaderboard shows?

How are models that use in-context examples handled?

Hi, thanks for this great work! I was wondering how models which can use in-context examples are evaluated. Is everything in this benchmark kit zero-shot?

OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.xxx.xxx --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

But I met the following problem below:
Traceback (most recent call last):
File "/train-xxx/code/xxx/src/test_scripts/test.py", line 6, in
projector = AutoModel.from_pretrained(projector_path,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I also used the following script to have a test but got the same error by my side.

import os.path as osp
import torch
from transformers import AutoModel

projector_path="xtuner/llava-internlm2-20b/projector"
projector = AutoModel.from_pretrained(projector_path,
                                              trust_remote_code=True,
                                              torch_dtype=torch.float16,
                                              device_map='cpu')

IndexError: index 1 is out of bounds for dimension 0 with size 1

cur_image_features = image_features[cur_image_idx]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1

Is there any reason why previous version can evaluate normally, but updated from git got such error?

And this error happens only during evaluation at about 6%

Is there zero images samples in eval MMBench?

MMMU test set

It will be nice if VLMEvalKit can generate result for MMMU test set,

opencompass多模态榜单上的分数是exact_matching还是GPT辅助计算的分数

如题，请问opencompass多模态榜单上的分数是exact_matching还是GPT辅助计算的分数？如果没有openai Key的话，是不是得不到opencompass上榜单的分数？谢谢

LLava GPT Eval无法对齐

在LLava中，Prompt是输出字母https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/llava.py#L96，然而在GPT Eval
中，https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/evaluate/multiple_choice.py#L75 却要求模型输出选项的内容，这里的Diff是为啥呢？

Get a 0 score in Evaluation of qwen chat in ScienceQA

As I said in the title,

Why? I have some result in qwen_chat_ScienceQA_TEST_prefetch.xlsx:

But I have no answer with openai, should I must use openai api to evaluate the dataset?

MathVista测试问题

请问MathVista-mini的Prefetch rate和Acc有什么区别？我测试发现Prefetch rate为52.8，acc只有44.1

Error when evaluating mmbench_dev_en：The correct answer according to the 'answer' field in the table should be D, but the log says it is A.

The correct answer according to the 'answer' field in the table should be D, but the log says it is A.

TypeError in parallel API calling

I met this error when I called openai API parallelly (nproc=4). I am not sure if this an error from the side of VLMEvalKit developpers...Could someone give me some tips for fixing it?

The scores calculated by VLMEvalKit differ from the score calculated on the MMBench website

MMBench_DEV_EN_openai_result.xlsx

When testing the results in this table：

The result on the MMbench website

By VLMEvalKit：
"split","Overall","AR","CP","FP-C","FP-S","LR","RR",
"dev","0.7328178694158075","0.7638190954773869","0.8277027027027027","0.6223776223776224","0.7610921501706485","0.4745762711864407","0.7652173913043478"

The code for evaluating open question of MMMU is completely wrong

It turns all open question of MMMU into "A" choice, and set all non-upper case answer to "A", which leads to higher score if the model outputs "A" by accident.

open-compass / vlmevalkit Goto Github PK

vlmevalkit's People

Contributors

Stargazers

Watchers

Forkers

vlmevalkit's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs