GithubHelp home page GithubHelp logo

kevinleestone / lm-evaluation-harness Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wellecks/lm-evaluation-harness

0.0 0.0 0.0 32.1 MB

A framework for few-shot evaluation of autoregressive language models.

License: MIT License

Shell 0.92% C++ 0.91% Python 98.17%

lm-evaluation-harness's Introduction

Llemma evaluation harness

Fork of the Eleuther LM Evaluation Harness used in Azerbayev et al 2023.

Running the evaluation

See eval_scripts/generic_run.sh for an entrypoint to running evaluation on a model from the HuggingFace Hub.

The script shows the set of non-theorem-proving tasks:

SYMBOLIC=minerva_math*,gsm8k,ocw_courses
MUL_CHOICE=minerva-hendrycksTest*,math_sat_cot
TOOLS=sympy_math*,python_gsm8k

Refer to lm_eval/tasks directory for their associated implementations.

Theorem proving task

The informal-to-formal theorem proving task is kept in the minif2f-isabelle branch. Please see the README in this branch for further instructions.

Additions

This Llemma evaluation harness implemented several extensions of the Eleuther LM Evaluation Harness at the time of development. Note that these may have been implemented in the Harness subsequently. An incomplete list includes:

  • Support for vLLM
  • Saving generated sequences and metadata
  • Majority voting (see configs/majk.json for example usage)
  • Temperature and top-p sampling
  • Domain-specific evaluation (e.g. Sympy equivalence)

Tasks Supported

Below, we detail all evaluations implemented and reported in our paper.

  • math_sat_cot: A small test set of SAT questions from the May 2023 College Board SAT examination, which occurred after the knowledge cutoff for Llemma's training set. Evaluated via chain-of-thought in natural language.
  • hendrycks_math_ppl: Perplexity evaluation on reference answers of sub-tasks of the MATH dataset.
  • minif2f_isabelle: Proof autoformalization in Isabelle on the miniF2F benchmark based on Draft-Sketch-Prove, with a Portal-to-Isabelle proof checker.
  • minerva_math*: The MATH benchmark with the prompt and Sympy evaluation from Minerva.
  • minerva-hendrycksTest*: MMLU-STEM tasks with prompting and chain-of-thought, following Minerva.
  • ocw_courses: The OCW Courses task from Minerva.
  • python_gsm8k: GSM8k solved by writing Python programs that return the numeric answer, based on PAL.
  • sympy_math: MATH evaluation, with Sympy or Python math modules used to write a programmatic solution.

We additionally implement the following tasks in this fork, though we do not report them in our paper due to time+space limitations:

  • lila_* - Evaluation on the Lila dataset. Note that this requires executing model-generated code.
  • proofnet* - Evaluation on the ProofNet dataset for both auto- and in- formalization. Informalization requires GPT-3.5 evaluation and an OpenAI API key.

Quick Replication Instructions

Maj@1

To run the model on desired tasks with 1 attempt, run the following sample command with your model.

MODEL=EleutherAI/llemma_7b # your HF Hub model path here
TASK=minerva_math* # select tasks as desired. This codebase supports wildcard task names.
OUT=</path/to/save/outputs>

python main.py --no_cache --model vllm --model_args pretrained=${MODEL} --tasks $TASK --output_path ${OUT} --tp_degree ${TP_DEGREE}

Maj@K

To replicate Maj@K task results, additionally pass --description_dict_path configs/majk.json to run majority voting with K attempts.

MODEL=EleutherAI/llemma_7b # your HF Hub model path here
TASK=minerva_math* # select tasks as desired. This codebase supports wildcard task names.
OUT=</path/to/save/outputs>

python main.py --no_cache --model vllm --model_args pretrained=${MODEL} --tasks $TASK --output_path ${OUT} --tp_degree ${TP_DEGREE} --description_dict_path ${HARNESS_DIR}/configs/majk.json

TP_DEGREE can be set as needed to determine how many GPUs will be used by vLLM.

Be sure to set $OUT to the desired save location for scores and model output text.

Answer Checking + Scoring

Due to heavy CPU burden, we do not calculate metrics for tasks like minerva_math that rely on checking correctness via SymPy equivalence, or tasks like sympy_math or python_gsm that require execution of model-generated Python code.

After running the model on one of these tasks, we provide utilities to perform answer checking.

Note that SymPy answer checking can be quite heavy on CPU resources and time-consuming for Maj@K at high K.

๐Ÿšจ WARNING: unsafe_score_sympy_math.py and unsafe_score_python_gsm.py will execute model-written Python code! Please use in a sandbox and at your own risk.

๐Ÿšจ WARNING: scoring scripts modify eval-harness output JSONs in-place. Back up your results files and use with caution!

To score sympy_math outputs, run:

python unsafe_score_sympy_math.py --output <path-to-results-with-sympy_math>

To score python_gsm8k outputs, run

python unsafe_score_python_gsm.py --output <path-to-results-with-python_gsm8k>

These scripts will take in a results file, read in the LM's generated programs, and execute them to check for correctness. It will then incorporate the per-sample and full-task accuracies into the results file and rewrite the entire file with these values added.

All scripts allow for a --limit X flag to be passed to only score the first X documents.

Due to the high resource cost of scoring MATH with SymPy, unsafe_score_minerva_math.py has additional requirements.

To run Sympy scoring using multiprocessing, run

python unsafe_score_minerva_math.py --output <path-to-math-result.json>

To run in a single process, run

python unsafe_score_minerva_math.py --output <path-to-math-result.json> --no_multiprocessing

Additionally, MATH scoring with SymPy is resumable--results and pass rates / accuracies for each document are saved to the results file in-place. by default, the script will not rescore already-scored documents.

Aggregation

Finally, we provide utilities for aggregating MMLU and MATH scores across subtasks, aggregating at the sample level rather than the subset level.

To aggregate MMLU-STEM scores, run:

python score_mmlu.py <path-to-mmlu-results.json> 

To aggregate MATH scores, run:

python score_math.py <path-to-math-subtask-1.json>,<path-to-math-subtask-2-and-3.json>,...

Citation

Please cite the Llemma paper if you find code from this fork useful to your work:

@article{
}

Please cite the Eleuther LM Evaluation Harness using:

@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}

lm-evaluation-harness's People

Contributors

leogao2 avatar jon-tow avatar zhangir-azerbayev avatar stellaathena avatar wellecks avatar cjlovering avatar haileyschoelkopf avatar zphang avatar anishthite avatar thefazzer avatar muennighoff avatar thomasw21 avatar sdtblck avatar tttyuntian avatar researcher2 avatar kasnerz avatar khalidalt avatar jeffhsu3 avatar albertqjiang avatar shashi456 avatar dirkgr avatar erictang000 avatar keirp avatar cfoster0 avatar anthony-dipofi avatar kkawamu1 avatar xagi-dev avatar uyhcire avatar andyzoujm avatar kingoflolz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.