GithubHelp home page GithubHelp logo

binliang2021 / seq2seq_llm_evaluation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from protagolabs/seq2seq_llm_evaluation

0.0 0.0 0.0 848 KB

This repository contains the code used to produce the results for the paper "Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks", published at EMNLP 2023.

License: MIT License

Python 84.99% HTML 15.01%

seq2seq_llm_evaluation's Introduction

seq2seq_llm_evaluation

This project contains the code used to produce the results for the paper "Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks", which has been accepted at EMNLP 2023, main conference. It also contains the full instructions provided to human reviewers and GPT-4 for model evaluation in the main/human_and_gpt4_evaluation/instructions_to_human_reviewers_and_gpt4 folder. For any questions on the code please contact Andrea Sottana.

Please use the following bibtex when referencing this paper. This is currently based on the arXiv preprint, and will be updated once the peer-reviewed publication link becomes available.

@article{sottana2023evaluation,
      title={Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks}, 
      author={Andrea Sottana and Bin Liang and Kai Zou and Zheng Yuan},
      journal={arXiv preprint arXiv:2310.13800},
      year={2023},
      eprint={2310.13800},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The project is structured as follows; each python module has a docstring at the top explaining its use case.

  • data: This folder hosts all the data. As mentioned in the notes below, we have not included raw data for the full study, only some processed data for the human and GPT-4 evaluation sub-study.
  • main: The folder hosting the main code in the subfolders below.
    • text_generation: This folder contains the modules to prompt the LLMs to generate the main outputs to be evaluated for the three tasks, text summarisation, simplification and grammatical error correction.
    • data_processing: This folder contains all utils and miscellaneous modules used for data preprocessing. In particular, newsela_preprocessing.py should be run before running any files in the text_generation folder, merge_outputs.py should be run after running the files in the text_generation folder and before running the files in the automatic_evaluation folder. Every other file in this folder is used to prepare the data for the human evaluation study using the Potato annotation tool.
    • automatic_evaluation: This folder contains the code used to reproduce the automatic metrics results, including calculating the T-test between the various distributions.
    • human_and_gpt4_evaluation: This folder contains the code used to prompt GPT-4 to evaluate LLMs outputs, and to generate the statistics of the human evaluation study which are displayed in the paper, as well as the inter-annotator agreement.
      • instructions_to_human_reviewers_and_gpt4: This subfolder contains the instructions given to human reviewers, and the prompts used for GPT-4 model-to-model evaluation. The instructions to human reviewers are reported in html files, as they were included in the UI of the Potato annotation tool. The code for the human evaluation UI, which is the Potato code with some minor modifications, is not included in this repository.

Notes

  • This codebase as it stands is not sufficient to reproduce our results in full without modifications. Sometimes manually changing minor parameters (such as a model's temperature) will be required to reproduce the full spectrum of results.

  • We have not included the raw data. You will need to source the data files yourself and place them in the data folder before running the code. All datasets used are open-source, but some must be requested directly to the original owners. We have, however, included the human and GPT-4 evaluation outputs in the data/outputs folder; these can be used to reproduce the human evaluation statistics discussed in our paper.

    • The CNN/Daily Mail dataset for text summarisation can be downloaded from Kaggle (link here) or HuggingFace (link here).
    • The Newsela dataset for text simplification must be requested via this link.
    • The BEA-2019 Shared Task for grammatical error correction can be downloaded via this link. Note that this dataset requires some processing before it can be used, and you should follow the instructions in the link above in order to generate the M2 file. We have not included the data processing code where this is provided by the project's authors, unless we made specific modifications required to reproduce our results. We will however expect anyone downloading the BEA-2019 dataset to independently carry out the preprocessing steps as described in the link above and at this link, before attempting to reproduce our results.
  • In order to generate a UI for the human evaluation study, we used the Potato annotation tool. We had to make some minor front-end modifications in order to suit our use case. However, as they were minor changes largely embedded within a fork of their project, we have not reported this code here. This does not affect results reproducibility, as the Potato code only generates the UI to display the samples for the human evaluation study, and every researcher can use their preferred tool for this purpose.

  • This project is released under the MIT Licence.

seq2seq_llm_evaluation's People

Contributors

andreasottana avatar binliang2021 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.