logikon-ai / cot-eval Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 1.0 1.43 MB

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Home Page: https://huggingface.co/spaces/logikon/open_cot_leaderboard

License: MIT License

Python 7.74% Shell 1.16% Jupyter Notebook 90.97% Dockerfile 0.12%

chain-of-thought gen-ai leaderboard llm llms-benchmarking llms-reasoning

cot-eval's People

Contributors

Stargazers

Watchers

cot-eval's Issues

Evaluate: allenai/tulu-2-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=64

Evaluate: allenai/tulu-2-dpo-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

add pipeline-parallel-size

add --pipeline-parallel-size for vllm, to efficiently use multi-gpu resources

Together.ai support

@ggbetz Adding this so we don't forget.

This includes

integrating together.ai with langchain (you said this was already done)
Making compatible with eval-harness (my thought is to use together.ai OpenAI feature, which should work with harness).

Evaluate: 01-ai/Yi-34B

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

Evaluate: NousResearch/Nous-Hermes-Llama2-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=NousResearch/Nous-Hermes-Llama2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

check validity of token early on, not after traces have been generated

Evaluate: upstage/SOLAR-10.7B-v1.0

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: databricks/dbrx-base

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for dbrx support in vllm, update container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

propagate vllm_kwargs to cot-configs

cot-configs have vllm_kwargs sub-dict, but this is static and does not reflect the vllm arguments given in config.env

wandb integration

to optionally log GPUSTATS (and more...)

running evals in parallel

make sure that running cot-evals in parallel doesn't create conflicts when uploading final results (or earlier on)

Evaluate: meta-llama/Llama-2-70b-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: ai21labs/Jamba-v0.1

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for Jamba support in vllm or implement HF transformers evaluation, and update cot-eval container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

not enough swap space issue

when evaluating microsoft/orca-7b

Evaluate: 01-ai/Yi-34B-Chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openchat/openchat-3.5-0106-gemma

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openchat/openchat-3.5-0106-gemma
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: upstage/SOLAR-10.7B-Instruct-v1.0

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-Instruct-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: google/gemma-7b-it

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b-it
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: allenai/tulu-2-dpo-7b

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openbmb/Eurus-70b-sft

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-70b-sft
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-13b-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: NousResearch/Nous-Hermes-Llama2-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: mistralai/Mixtral-8x7B-v0.1

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: mistralai/Mixtral-8x7B-Instruct-v0.1

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-Instruct-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: openbmb/Eurus-7b-kto

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-7b-kto
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: allenai/tulu-2-13b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: Qwen/Qwen-72B-Chat

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen-72B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: allenai/tulu-2-dpo-70b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: Qwen/Qwen1.5-14B

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-14B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: Qwen/Qwen1.5-72B

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-72B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-13b-chat-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: databricks/dbrx-instruct

Check upon issue creation:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
Wait for dbrx support in vllm, update container
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

Run cot-eval pipeline
Merge pull requests for cot-eval results datats (> @ggbetz)
Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: google/gemma-7b

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: meta-llama/Llama-2-70b-chat-hf

Check:

The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
There is no evaluation request issue for the model in the repo.
The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

logikon-ai / cot-eval Goto Github PK

cot-eval's People

Contributors

Stargazers

Watchers

cot-eval's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs