Hierarchical Prompting Taxonomy

Paper · Documentation · Leaderboard

A Universal Evaluation Framework for Large Language Models

Table of Contents

News
Introduction
Demo
Installation
Usage
Datasets and Models
Benchmark Results
References
Contributing
Cite Us

News

[06-18-24] HPT is published ! Check out the paper here.

↑ Back to Top ↑

Introduction

Hierarchical Prompting Taxonomy (HPT) is a universal evaluation framework for large language models. It is designed to evaluate the performance of large language models on a variety of tasks and datasets assigning HP-Score for each dataset relative to different models. The HPT employs Hierarchical Prompt Framework (HPF) which supports a wide range of tasks, including question-answering, reasoning, translation, and summarization. It provides a set of pre-defined prompting strategies tailored for each task based on its complexity. Refer to paper at : https://arxiv.org/abs/2406.12644

Features of HPT

Universal Evaluation Framework: HPT is a universal evaluation framework that can support a wide range of datasets and LLMs.
Hierarchical Prompt Framework: HPF is a set of prompting strategies tailored for each task based on its complexity employed by the HPT. HPF is made available in two modes: manual and adaptive. Adaptive HPF selects the best prompting strategy for a given task adaptively by a LLM (prompt-selector).
HP-Score: HPT assigns an HP-Score for each dataset relative to different agents(including LLMs and humans). HP-Score is a measure of the capability of an agent to perform a task related to a dataset. Lower HP-Score indicates better performance over the dataset.

↑ Back to Top ↑

Demo

Refer to examples directory for using the framework on different datasets and models.

↑ Back to Top ↑

Installation

Cloning the Repository

To clone the repository, run the following command:

git clone https://github.com/devichand579/HPT.git

↑ Back to Top ↑

Usage

Linux

To get started on a Linux setup, follow these setup commands:

Activate your conda environment:
```
conda activate hpt
```
Navigate to the main codebase
```
cd HPT/hierarchical_prompt
```
Install the dependencies
```
pip install -r requirements.txt
```
Add your Hugging Face token
- Create a .env file in the conda environment
```
HF_TOKEN = "your HF Token"
```
To run both frameworks, use the following command structure
```
bash run.sh method model dataset [--thres num]
```
- method
  - man
  - auto
- model
  - llama3
  - phi3
  - gemma
  - mistral
- dataset
  - boolq
  - csqa
  - iwslt
  - samsum
- If the datasets are IWSLT or SamSum, add '--thres num'
- num
  - 0.15
  - 0.20
  - or higher thresholds apart from our experiments.
- Example commands:
```
bash run.sh man llama3 iwslt --thres 0.15
```
```
bash run.sh auto phi3 boolq 
```

↑ Back to Top ↑

Datasets and models

HPT currently supports different datasets, models and prompt engineering methods employed by HPF. You are welcome to add more.

Datasets

Question-answering datasets:
- BoolQ
Reasoning datasets:
- CommonsenseQA
Translation datasets:
- IWSLT-2017 en-fr
Summarization datasets:
- SamSum

Models

Language models:
- Llama 3 8B
- Mistral 7B
- Phi 3 3.8B
- Gemma 7B

Prompt Engineering

Role Prompting [1]
Zero-shot Chain-of-Thought Prompting [2]
Three-shot Chain-of-Thought Prompting [3]
Least-to-Most Prompting [4]
Generated Knowledge Prompting [5]

↑ Back to Top ↑

Benchmark Results

The benchmark results for different datasets and models are available in the leaderboad.

↑ Back to Top ↑

References

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2023). Better Zero-Shot Reasoning with Role-Play Prompting. ArXiv, abs/2308.07702.
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. ArXiv, abs/2205.11916.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903.
Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., & Chi, E.H. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv, abs/2205.10625.
Liu, J., Liu, A., Lu, X., Welleck, S., West, P., Le Bras, R., Choi, Y., & Hajishirzi, H. (2021). Generated Knowledge Prompting for Commonsense Reasoning. Annual Meeting of the Association for Computational Linguistics.

↑ Back to Top ↑

Contributing

This project aims to build open-source evaluation frameworks for assessing LLMs and other agents. This project welcomes contributions and suggestions. Please see the details on how to contribute.

If you are new to GitHub, here is a detailed guide on getting involved with development on GitHub.

↑ Back to Top ↑

Cite Us

If you find our work useful, please cite us !

@misc{budagam2024hierarchical,
      title={Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models}, 
      author={Devichand Budagam and Sankalp KJ and Ashutosh Kumar and Vinija Jain and Aman Chadha},
      year={2024},
      eprint={2406.12644},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}

↑ Back to Top ↑

devichand579 / hpt Goto Github PK

hpt's Introduction

Hierarchical Prompting Taxonomy

News

Introduction

Features of HPT

Demo

Installation

Cloning the Repository

Usage

Linux

Datasets and models

Datasets

Models

Prompt Engineering

Benchmark Results

References

Contributing

Cite Us

hpt's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs