GithubHelp home page GithubHelp logo

logikon-ai / cot-eval Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 1.43 MB

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Home Page: https://huggingface.co/spaces/logikon/open_cot_leaderboard

License: MIT License

Python 7.74% Shell 1.16% Jupyter Notebook 90.97% Dockerfile 0.12%
chain-of-thought gen-ai leaderboard llm llms-benchmarking llms-reasoning

cot-eval's People

Contributors

ggbetz avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

cot-eval's Issues

Evaluate: allenai/tulu-2-70b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=64

Evaluate: allenai/tulu-2-dpo-13b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Together.ai support

@ggbetz Adding this so we don't forget.

This includes

  • integrating together.ai with langchain (you said this was already done)
  • Making compatible with eval-harness (my thought is to use together.ai OpenAI feature, which should work with harness).

Evaluate: 01-ai/Yi-34B

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

Evaluate: NousResearch/Nous-Hermes-Llama2-70b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=NousResearch/Nous-Hermes-Llama2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: upstage/SOLAR-10.7B-v1.0

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: databricks/dbrx-base

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • Wait for dbrx support in vllm, update container
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

running evals in parallel

make sure that running cot-evals in parallel doesn't create conflicts when uploading final results (or earlier on)

Evaluate: meta-llama/Llama-2-70b-hf

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: ai21labs/Jamba-v0.1

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • Wait for Jamba support in vllm or implement HF transformers evaluation, and update cot-eval container
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: 01-ai/Yi-34B-Chat

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=01-ai/Yi-34B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=8

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openchat/openchat-3.5-0106-gemma

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openchat/openchat-3.5-0106-gemma
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: upstage/SOLAR-10.7B-Instruct-v1.0

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=upstage/SOLAR-10.7B-Instruct-v1.0
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: google/gemma-7b-it

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b-it
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: allenai/tulu-2-dpo-7b

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-dpo-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: openbmb/Eurus-70b-sft

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-70b-sft
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-13b-hf

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: NousResearch/Nous-Hermes-Llama2-13b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: mistralai/Mixtral-8x7B-v0.1

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: mistralai/Mixtral-8x7B-Instruct-v0.1

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=mistralai/Mixtral-8x7B-Instruct-v0.1
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

Evaluate: openbmb/Eurus-7b-kto

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openbmb/Eurus-7b-kto
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: allenai/tulu-2-13b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-13b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: Qwen/Qwen-72B-Chat

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen-72B-Chat
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: allenai/tulu-2-dpo-70b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=allenai/tulu-2-70b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Evaluate: Qwen/Qwen1.5-14B

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-14B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: Qwen/Qwen1.5-72B

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=Qwen/Qwen1.5-72B
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: meta-llama/Llama-2-13b-chat-hf

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-13b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: databricks/dbrx-instruct

Check upon issue creation:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • Wait for dbrx support in vllm, update container
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=<org>/<model>
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=4

ToDos:

  • Run cot-eval pipeline
  • Merge pull requests for cot-eval results datats (> @ggbetz)
  • Create eval request record to update metadata on leaderboard (> @ggbetz)

Evaluate: google/gemma-7b

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=google/gemma-7b
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

Evaluate: meta-llama/Llama-2-70b-chat-hf

Check:

  • The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
  • There is no evaluation request issue for the model in the repo.
  • The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=meta-llama/Llama-2-70b-chat-hf
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=float16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=16

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.