GithubHelp home page GithubHelp logo

shan23chen / healthllm_eval Goto Github PK

View Code? Open in Web Editor NEW
14.0 2.0 0.0 12.28 MB

Github repo for the Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification, code and data | Paper: https://arxiv.org/pdf/2304.02496.pdf

Python 100.00%

healthllm_eval's Introduction

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

Main Contributions:

  1. Explore accessible LLMs abilities on more dense linguistic tasks among clinical texts other than QA
  2. Use state-of-the-art LLMs, such as GPT-3.5/GPT-4, and adapt them to the task using in-context learning
  3. Compare the results of these models across tasks to provide insights for future research and offer guidance on task appropriation model training
  4. Provide an open source biomedical dataset with prompts and associated outputs generated by GPT series models

Objective/Purpose:

Artificial intelligence applications in biomedicine require reliable performance with minimal errors. The recent advances in large language models (LLMs) have shown impressive ability in clinical question-answering (QA), but have not been adequately investigated for more specific biomedical use cases. This study aims to investigate the performance of LLMs, such as the ChatGPT family of models (GPT-3.5, GPT-4) in frequent biomedical tasks beyond QA: reasoning and classification. We investigate the practical use of these models for healthcare institutions and researchers with standard computational resources, and compare their performance with classic, less computationally intensive methods.

Materials/Methods:

Because the openAI API is a public interface, no patient data can be passed to it. Thus, we evaluated model performance on two tasks with a total of >10,000 samples as proxies for two of the most frequent tasks in the clinical domain โ€“ classification and reasoning (e.g. determining causative relationships). The first task was classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice across 3 datasets. The second task was causal relation extraction from the biomedical literature in 1 dataset. We used 20% of the entire dataset as development (dev) data for the prompts under the settings of zero- and few- shots with and without chain of thought (CoT). The most effective prompt from each setting was evaluated with the remaining 80% split. We compared models using simple features (bag-of-words (BoW)) with logistic regression, and fine-tuned BioBERT models.

Results:

Overall, fine-tuning BioBERT yielded the best results for the classification (0.80-0.90) and reasoning (F1 0.85) tasks. Of the LLM approaches, few-shot with CoT prompting yielded the best results for the classification (F1 0.67-0.77) and reasoning (F1 0.68) tasks, with results comparable to the BoW model (F1 0.60-0.75 and 0.68 for classification and reasoning, respectively) as table 1 and figure 1(all are in Macro F1). chart 0 Figure 1 The total time needed to achieve the best-performing LLM results was 78 hours, compared to 0.08 and 0.01 hours to develop the best-performing BioBERT and BoW models, respectively (as figure 2). Figure 2

The total cost of the experiments through OpenAI API call was $1,299.18 USD based on March 2023 pricing.

Conclusions:

Despite the hype around ChatGPT, we found that fine-tuning smaller models for our fundamental NLP tasks remained the best strategy in terms of performance and time efficiency. The simple BoW feature extractor performed on par with the most complex LLM prompting. Prompt engineering required significant time investment due to its complexity which should be taken into consideration in potential use of LLMs.

Code structure:

The structure of this repo is quite self-explanatory.

  • All the prompts are under prompts folder
  • All the original data and folds are under data folder
  • All the development set outputs are under dev_outputs folder
  • All the test set outputs are under test_outputs folder
  • src holds all the .py files for all the experiment runs

Funding declaration:

R01GM114355: Funding for personnel. No involvement in manuscript writing or decision to submit for publication.

R01LM013486: Funding for personnel. No involvement in manuscript writing or decision to submit for publication.

Woods Foundation: Funding for personnel. No involvement in manuscript writing or decision to submit for publication.

How to cite us:

@misc{chen2023evaluation,
      title={Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification},
      author={Shan Chen and Yingya Li and Sheng Lu and Hoang Van and Hugo JWL Aerts and Guergana K. Savova and Danielle S. Bitterman},
      year={2023},
      eprint={2304.02496},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

healthllm_eval's People

Contributors

yingyali avatar shan23chen avatar vanh17 avatar daniellebitt avatar

Stargazers

Ivan William Harsono avatar Zeng haolong avatar  avatar  avatar Brian Cajes avatar Peter Clarke avatar Jeff Hammerbacher avatar  avatar Madhumita Sushil avatar  avatar Elmira Ghorbani avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.