GithubHelp home page GithubHelp logo

open-compass / vlmevalkit Goto Github PK

View Code? Open in Web Editor NEW
708.0 10.0 82.0 2.07 MB

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks

Home Page: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard

License: Apache License 2.0

Python 95.56% Shell 0.12% Jupyter Notebook 4.32%
gpt-4v large-language-models llava multi-modal openai vqa llm openai-api qwen gpt

vlmevalkit's People

Contributors

bingwork avatar binwang777 avatar cuiunbo avatar czczup avatar echo840 avatar ezra-yu avatar fangxinyu-0913 avatar feipengma6 avatar fitzpchao avatar iyuge2 avatar jize-w avatar junming-yang avatar kennymckormick avatar lightdxy avatar llllilllll avatar lzhgrla avatar naoto0804 avatar pciresearch avatar quakumei avatar shuozhang2003 avatar starcycle avatar tousenkaname avatar victorsanh avatar xiaoachen98 avatar yjy123 avatar youngfly11 avatar yuanliuuuuuu avatar yuzhiyin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vlmevalkit's Issues

【提问】关于其他测试集的支持

hi,
感谢你们团队的工作。
我想咨询一下,

  1. 请问后续会支持GQA,OKVQA,CMMMU这些测试集的推理评估吗?
  2. 后续会像opencompass一样,支持调用api进行多模态评估吗?
  3. chartqa,textvqa这些分数对不上官方论文的数值,请问后续会进行优化吗?

Enhancing Multi-Choice Question Handling with Case-Sensitive Matching

It might be beneficial to implement a mechanism for exact matching of uppercase option letters in multi-choice questions. This could help avoid confusion caused by the presence of quantifiers like "a" in responses, which might be mistakenly interpreted as indicating multiple choices. Additionally, in cases where multiple letters or multiple instances of "yes"/"no" appear, the system could prioritize the analysis of the first word in the sentence to determine the intended response.

I am also curious about whether the scores currently displayed on the OpenCompass leaderboard have been updated to reflect these latest modifications. Could you provide any information on this?

Evaluation of custom models and datasets.

VLMEVALKIT is a pretty convenient evaluation tool for MLLMs. I hope that the esteemed authors can create a framework for VLMEVALKIT that supports the evaluation of custom models and custom datasets. This framework can define a unified MLLM input-output interface and the conversion format for datasets.

where is the vlmeval/utils/data_util.py

image
Excellent work:)
But I can`t fine the data_util file in the lasted version..
And I would like to ask when the Internlm2 7b model will be supported.Thanks!

(feature request) can we add load_dotenv() as a small quality of life improvement?

Hi OpenCompass VLMEvalKit team,

Thank you for your hard work on this project! I have a very minor feature request - can we add load_dotenv to make it easier for users to run without explicitly setting their OPENAI_API_KEY env variables in the terminal before their run?

This way, a user can add their key to the variable once in a .env file and it will automatically be loaded.
Happy to open a pull request if helpful.

Cannot reproduce llava v1.5 7b SEEDBench_IMG results

When i use default setting to initiate llava v1.5 7b model evaluation on SEEDBench_IMG dataset , I got results like this :
屏幕截图 2024-01-04 211703

I checked intermediate results , but the model seems to generate options correctly :
屏幕截图 2024-01-04 212152

The default generation config should be

  • do_sample=True
  • temperature=0.2
  • max_new_tokens=512
  • top_p=None
  • num_beams=1

And the officially reported results should be
屏幕截图 2024-01-04 212602

It's really weird . I don't know why there is a huge gap here. Hope to get help. Thank you in advance!

There is a long gap between the validation accuracy of the dataset of vlmevalkit and the model paper

On the TextVQA dataset, the paper in Instructblip 13b indicates that its precision is 50.7, and the paper in Qwen VL Chat shows an accuracy of 63.75.
In terms of the accuracy measured by the vlmevalkit official, the accuracy of Instructblip 13b is about 30, and the accuracy of QWEN VL Chat is 10.5, what do you think is the problem?
Also, I tested the accuracy of Instructblip 13b on textVQA and found that I ran with an accuracy of 16.7, what went wrong? These are all the results of prefech, and GPT is not used.

If i want to use model of LLaVa in VLMEval, which version should i installed for these?

The error below appears when I install LLava v1.1.3 after having installed latest VLMEval:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.                                                                            
jupyterlab 4.1.2 requires httpx>=0.25.0, but you have httpx 0.24.0 which is incompatible.                                    
xtuner 0.1.13 requires transformers!=4.34.1,!=4.35.0,!=4.35.1,!=4.35.2,>=4.32.1, but you have transformers 4.31.0 which is incompatible.                                                                                                                  
vlmeval 0.1.0 requires gradio==4.15.0, but you have gradio 3.35.2 which is incompatible.                                     
vlmeval 0.1.0 requires transformers==4.33.0, but you have transformers 4.31.0 which is incompatible. 

ModuleNotFoundError: No module named 'xtuner.parallel'

I met this problem when test by cmd:
torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.244.33 --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

Traceback (most recent call last):
File "/code/src/VLMEvalKit/run.py", line 153, in
main()
File "/code/src/VLMEvalKit/run.py", line 83, in main
model = infer_data_job(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 210, in infer_data_job
model = infer_data(
File "/code/src/VLMEvalKit/vlmeval/inference.py", line 142, in infer_data
response = model.generate(prompt=struct['text'], image_path=struct['image'], dataset=dataset_name)
File "/code/src/VLMEvalKit/vlmeval/vlm/llava_xtuner.py", line 177, in generate
from xtuner.model.utils import prepare_inputs_labels_for_multimodal
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/init.py", line 3, in
from .sft import SupervisedFinetune
File "/usr/local/lib/python3.10/dist-packages/xtuner/model/sft.py", line 16, in
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
ModuleNotFoundError: No module named 'xtuner.parallel'
[2024-04-03 03:58:31,981] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2207729) of binary: /bin/python

ChartQA augmented & CMMMU

It would be nice if VLMEvalKit can support evaluation on ChartQA augmented set and CMMMU, since it has supported ChartQA human set and MMMU

Unknown error when loading LLaVA model

when I run the command
python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model llava_v1.5_13b --verbose
it shows
warnings.warn('Unknown error when loading LLaVA model.')
image

how to deal with this?

[疑问] 实验分数对不上,使用llava_v1.5_7b在MMMU_DEV_VAL测试集上的分数对不上

环境:

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda-11.7',
 'GCC': 'gcc (GCC) 8.4.1 20200928 (Anolis 8.4.1-1.0.1)',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A100-PCIE-40GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99',
 'OpenCV': '4.9.0',
 'PyTorch': '1.13.1+cu117',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201402\n'
                              '  - Intel(R) Math Kernel Library Version '
                              '2020.0.0 Product Build 20191122 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.6.0 (Git Hash '
                              '52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX2\n'
                              '  - CUDA Runtime 11.7\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n'
                              '  - CuDNN 8.5\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -fabi-version=11 -Wno-deprecated '
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-DEDGE_PROFILER_USE_KINETO -O2 -fPIC '
                              '-Wno-narrowing -Wall -Wextra '
                              '-Werror=return-type -Werror=non-virtual-dtor '
                              '-Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '
                              '-Wno-unused-parameter -Wno-unused-function '
                              '-Wno-unused-result -Wno-strict-overflow '
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=redundant-decls '
                              '-Wno-error=old-style-cast '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, '
                              'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',
 'TorchVision': '0.14.1+cu117',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.2+4bd2256',
 'sys.platform': 'linux'}

命令:python run.py --data MMMU_DEV_VAL --model llava_v1.5_7b --verbose

输出:
image

llava_v1.5_7b_MMMU_DEV_VAL_acc.csv 文件保存的结果是如下所示。

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics","Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing","Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology","Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities & Social Science","Science","Tech & Engineering"
"dev","0.006666666666666667","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.2","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.02857142857142857"
"validation","0.014444444444444444","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.0","0.03333333333333333","0.06666666666666667","0.0","0.13333333333333333","0.1","0.06666666666666667","0.0","0.0","0.0","0.0","0.03333333333333333","0.016666666666666666","0.02666666666666667","0.009523809523809525"

请问为什么用这个命令推理评估的结果与官方差距那么大?

[Feature Request] To evaluate MMMU test set, you need to transfer the xlsx output to a json file

Hello,

When using VLMEvalKit with MMMU_TEST, you will generate a xlsx output file, e.g.,

image

This format cannot be accepted by the online MMMU EvalAI server. The server requires this json format.

The following code can transfer the xlsx file to the required json format:

import pandas as pd
import json

# 读取xlsx文件
def read_xlsx(file_path):
    # 使用pandas读取xlsx文件
    df = pd.read_excel(file_path, engine='openpyxl')
    return df

# 转换为单个字典的json格式
def convert_to_single_json(df):
    # 选择第一列和第23列
    selected_columns = df.iloc[:, [0, 22]]
    
    # 创建一个空字典用于存储结果
    result_dict = {}
    
    # 遍历每一行数据
    for index, row in selected_columns.iterrows():
        # 使用第一列的值作为键,第23列的值作为值
        result_dict[row[0]] = row[1]
        
    # 将字典转换为json格式的字符串
    json_data = json.dumps(result_dict, indent=4)
    
    return json_data

# 主函数
def main():
    # xlsx文件路径
    file_path = 'hpt-air-mmmu_MMMU_TEST.xlsx'  # 请替换为你的xlsx文件路径
    
    # 读取xlsx文件
    df = read_xlsx(file_path)
    
    # 转换为单个字典的json格式
    json_data = convert_to_single_json(df)
    
    # 输出json数据
    print(json_data)
    
    # 将json数据保存到文件
    with open('hpt-air-mmmu_MMMU_TEST.json', 'w') as f:
        f.write(json_data)

if __name__ == '__main__':
    main()

Would you like to add it into VLMEval?

Best,
StarCycle

Error Encountered in Multi-Node Evaluation Using Distributed Arguments

I encountered an issue while attempting to perform a multi-node evaluation using PyTorch's torchrun with specific distributed arguments. Below is the command I used, including the distributed arguments setup and the execution command:

=


DISTRIBUTED_ARGS=" \
    --nproc_per_node 3 \
    --nnodes 4 \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run.py \
    --data MME MMBench_DEV_EN MMBench_DEV_CN CCBench SEEDBench_IMG MMMU_DEV_VAL MathVista_MINI HallusionBench LLaVABench \
    MMBench_TEST_EN MMBench_TEST_CN \
    --model llava

Upon execution, I received the following error message:

RUN - ERROR - No such file or directory: './llava/312_MME.pkl'
It seems like only the .pkl file on node0 was saved correctly. Only 012,112,212 was saved.
Thank you in advance for your assistance!

llava 34B 评测时CUDA out of memory

你好,我在评测llava 34B 时,有8张卡可用,发现只占了一张卡进行推理,导致CUDA out of memory。请问llava 34B的评测支持多卡评测吗?谢谢

Questions regarding the metrics for SEED bench

Hi,

Thanks for putting up the benchmark and releasing the eval tool. I'm running some experiments on both MMBench and the SEED bench, where I'm having some confusion regarding the metrics in the SEED leaderboard and would appreciate any inputs.

image

Specifically, I have three questions.

  1. What does "heuristic matching" mean in ExactMatchRate?
  2. I'm not fully understanding the definition of MatchedAcc and ExactMatchAcc (and the difference between them). Would you mind explaining it with a concrete example?
  3. It is mentioned, for the official SEED leaderboard, that For models with limited instruction following capabilities (including qwen_base, MiniGPT-4, InstructBLIP, flamingov2), the performance gap between generation-based evaluation and PPL-based evaluation is significant. I understand what PPL-based evaluation means (ranking options by perplexity), but what does generation-based evaluation mean here?

Thank you in advance for your help.

HallusionBench的最终结果如何计算

HallusionBench最后的结果包含aAcc,fAcc,qAcc,请问最后的准确率是如何计算的?是否是这三者的平均? 另外MME的评测只包含preception部分吗?排行榜上好像是既包含preception又包含cognition。

Numpy compliation issue during installing

Hello. I created a new conda environment to install this project with python 3.10. According to the error message below it seems like numpy was unable to find the right blas library.

c/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c'
gcc: numpy/core/src/multiarray/alloc.c
gcc: numpy/core/src/multiarray/buffer.c
gcc: numpy/core/src/multiarray/common.c
gcc: numpy/core/src/multiarray/array_assign_scalar.c
gcc: numpy/core/src/multiarray/descriptor.c
gcc: numpy/core/src/multiarray/conversion_utils.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/einsum.c
gcc: numpy/core/src/multiarray/datetime_strings.c
gcc: numpy/core/src/multiarray/arrayobject.c
gcc: numpy/core/src/multiarray/array_assign_array.c
gcc: numpy/core/src/multiarray/ctors.c
gcc: numpy/core/src/multiarray/convert.c
gcc: numpy/core/src/multiarray/calculation.c
gcc: numpy/core/src/multiarray/datetime_busday.c
gcc: numpy/core/src/multiarray/arrayfunction_override.c
gcc: numpy/core/src/multiarray/convert_datatype.c
gcc: numpy/core/src/multiarray/hashdescr.c
gcc: numpy/core/src/multiarray/datetime_busdaycal.c
gcc: numpy/core/src/multiarray/item_selection.c
gcc: numpy/core/src/multiarray/compiled_base.c
gcc: numpy/core/src/multiarray/dragon4.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/arraytypes.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/lowlevel_strided_loops.c
gcc: numpy/core/src/multiarray/multiarraymodule.c
gcc: numpy/core/src/multiarray/datetime.c
gcc: numpy/core/src/multiarray/dtype_transfer.c
gcc: numpy/core/src/multiarray/nditer_constr.c
gcc: numpy/core/src/multiarray/iterators.c
gcc: numpy/core/src/multiarray/refcount.c
gcc: numpy/core/src/multiarray/scalarapi.c
gcc: numpy/core/src/multiarray/nditer_pywrap.c
gcc: numpy/core/src/multiarray/sequence.c
gcc: numpy/core/src/multiarray/shape.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘cfloat_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2967:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2967:12: error: too few arguments to function ‘_Py_HashDouble’
2967 | return _Py_HashDouble((double) PyArrayScalar_VAL(obj, @name@));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘clongdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2975:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2976 | PyArrayScalar_VAL(obj, C@name@).real);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2975:16: error: too few arguments to function ‘_Py_HashDouble’
2975 | hashreal = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:31: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~
| |
| double
2982 | PyArrayScalar_VAL(obj, C@name@).imag);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2981:16: error: too few arguments to function ‘_Py_HashDouble’
2981 | hashimag = _Py_HashDouble((double)
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2997:27: error: incompatible type for argument 1 of ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| double
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:38: note: expected ‘PyObject *’ {aka ‘struct _object *’} but argument is of type ‘double’
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src:2997:12: error: too few arguments to function ‘_Py_HashDouble’
2997 | return _Py_HashDouble(npy_half_to_double(PyArrayScalar_VAL(obj, Half)));
| ^~~~~~~~~~~~~~
In file included from /home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/Python.h:77,
from numpy/core/src/multiarray/scalartypes.c.src:3:
/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10/pyhash.h:10:23: note: declared here
10 | PyAPI_FUNC(Py_hash_t) _Py_HashDouble(PyObject *, double);
| ^~~~~~~~~~~~~~
numpy/core/src/multiarray/scalartypes.c.src: In function ‘longdouble_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘float_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2968:1: warning: control reaches end of non-void function [-Wreturn-type]
2968 | }
| ^
numpy/core/src/multiarray/scalartypes.c.src: In function ‘half_arrtype_hash’:
numpy/core/src/multiarray/scalartypes.c.src:2998:1: warning: control reaches end of non-void function [-Wreturn-type]
2998 | }
| ^
gcc: numpy/core/src/multiarray/temp_elide.c
gcc: numpy/core/src/multiarray/vdot.c
gcc: numpy/core/src/umath/umathmodule.c
gcc: numpy/core/src/multiarray/typeinfo.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/loops.c
gcc: numpy/core/src/multiarray/usertypes.c
gcc: numpy/core/src/multiarray/number.c
gcc: numpy/core/src/umath/reduction.c
gcc: numpy/core/src/umath/ufunc_object.c
gcc: numpy/core/src/umath/ufunc_type_resolution.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/multiarray/nditer_templ.c
gcc: numpy/core/src/multiarray/flagsobject.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/ieee754.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/npymath/npy_math_complex.c
gcc: numpy/core/src/multiarray/getset.c
gcc: numpy/core/src/umath/override.c
gcc: numpy/core/src/npymath/halffloat.c
gcc: numpy/core/src/multiarray/nditer_api.c
gcc: numpy/core/src/common/array_assign.c
gcc: numpy/core/src/common/ucsnarrow.c
gcc: numpy/core/src/npymath/npy_math.c
gcc: numpy/core/src/common/mem_overlap.c
gcc: numpy/core/src/common/ufunc_override.c
gcc: numpy/core/src/common/numpyos.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/common/npy_cpu_features.c
gcc: numpy/core/src/common/npy_longdouble.c
gcc: numpy/core/src/umath/extobj.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/scalarmath.c
gcc: numpy/core/src/multiarray/mapping.c
gcc: numpy/core/src/multiarray/methods.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/matmul.c
gcc: build/src.linux-x86_64-3.10/numpy/core/src/umath/clip.c
error: Command "gcc -pthread -B /home/ubuntu/mambaforge-pypy3/envs/vlme/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -O2 -isystem /home/ubuntu/mambaforge-pypy3/envs/vlme/include -fPIC -DNPY_INTERNAL_BUILD=1 -DHAVE_NPY_CONFIG_H=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/umath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Inumpy/core/include -Ibuild/src.linux-x86_64-3.10/numpy/core/include/numpy -Inumpy/core/src/common -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/home/ubuntu/mambaforge-pypy3/envs/vlme/include/python3.10 -Ibuild/src.linux-x86_64-3.10/numpy/core/src/common -Ibuild/src.linux-x86_64-3.10/numpy/core/src/npymath -c build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.c -o build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d" failed with exit status 1
[end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for numpy
  Failed to build numpy
  ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
  [end of output]

Detailed results of ScienceQA-IMG

Thanks for the great effort of this repo! I see you provide the zero-shot results of several MLLMs on ScienceQA-IMG dataset. Could you please add the detailed results (i.e., NAT, SOC, LAN) of the TEST and VAL partitions?

llava_v1.5_7b wrong results on Seedbench_IMG

Hi,

I have checked the saved result of llava7b on the seedbench_image benchmark and found that: for some problems, the llava7b can give the right prediction but in the evaluation framework it shows that it will return wrong answers.

For example, for index = 1198,4307
output1198
Which object is likely found in the boy's hand? A: A book B: A soccer ball C: A calculator D: A pencil

output4307
Where is the priest located in the image? A: In front of the stained glass window B: To the right of the bride and groom C: To the left of the bride and groom D: Behind the bride and groom

Can you help explain this? Thanks for your help!

A major problem with the multiple-choice evaluation

There is a major problem with the multiple-choice evaluation.
I am testing MMBench-dev-en here, I use the result document generated by the llava framework--llava_MMBench_DEV_EN.xlsx, the result of your test here is 0.68.

Because his prediction is all a single word--just the options, so I tried to match it myself, just if item['prediction']==item['answer'], and I found that the final result is 0.77, so your test standard is seriously wrong, or I missed something, please let me know.

If you want the result file to test, I can send you the result file, or you can just have a check.

image

How to calculate the average rank?

image
In the leaderboard, [LLaVA-InternLM2-20B (QLoRA)] get higher average score than Monkey-Chat, but Monkey-Chat rank higher. So how to calculate the Avg.Rank as the leaderboard shows?

OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

torchrun --nproc-per-node=8 --nnodes=1 --node_rank=0 --master_addr 10.255.xxx.xxx --master_port 8109 run.py --data LLaVABench --model llava-internlm2-20b --verbose

But I met the following problem below:
Traceback (most recent call last):
File "/train-xxx/code/xxx/src/test_scripts/test.py", line 6, in
projector = AutoModel.from_pretrained(projector_path,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'xtuner/llava-internlm2-20b/projector'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I also used the following script to have a test but got the same error by my side.

import os.path as osp
import torch
from transformers import AutoModel

projector_path="xtuner/llava-internlm2-20b/projector"
projector = AutoModel.from_pretrained(projector_path,
                                              trust_remote_code=True,
                                              torch_dtype=torch.float16,
                                              device_map='cpu')

IndexError: index 1 is out of bounds for dimension 0 with size 1

cur_image_features = image_features[cur_image_idx]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1

Is there any reason why previous version can evaluate normally, but updated from git got such error?

And this error happens only during evaluation at about 6%

Is there zero images samples in eval MMBench?

MMMU test set

It will be nice if VLMEvalKit can generate result for MMMU test set,

MathVista测试问题

请问MathVista-mini的Prefetch rate和Acc有什么区别?我测试发现Prefetch rate为52.8,acc只有44.1

TypeError in parallel API calling

企业微信截图_17042675828124

I met this error when I called openai API parallelly (nproc=4). I am not sure if this an error from the side of VLMEvalKit developpers...Could someone give me some tips for fixing it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.