GithubHelp home page GithubHelp logo

nickriccardi / two-word-test Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 8.41 MB

Two Word Test: Combinatorial Semantic Benchmark for LLMs

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
benchmark gpt-3-5-turbo gpt-4 claude-ai claude-api gemini gemini-api gpt-4-turbo large-language-models

two-word-test's Introduction

Two Word Test

Combinatorial Semantics Benchmark for Large Language Models (LLMs)

Nicholas Riccardi, Xuan Yang, and Rutvik Desai - University of South Carolina Department of Psychology

LLMs sometimes struggle with word-order effects and compositional (or combinatorial) language processes, especially when surrounding context is absent. Here, we provide the Two Word Test, a series of functions that compares LLM meaningfulness judgments of simple two word phrases to meaningfulness judgments made by humans (Graves et al., 2013; https://doi.org/10.3758/s13428-012-0256-3).

We provide a variety of statistical methods to quantify LLM performance and compare it to human performance. We test OpenAI's GPT-4 and GPT-3.5-turbo, and Google's Bard. Briefly, we find that GPT-3.5 and Bard fail dramatically at judging the meaningfulness of simple two word phrases without context. GPT-4 performs substantially better, but still fails in certain circumstances, especially when asked to make continuous instead of binary judgments.

Using the Two Word Test

To gather LLM meaningfulness ratings, we used the prompts detailed in our manuscript (closely mirroring the prompt used by Graves et al., 2013 to gather human ratings, but providing more examples for the LLMs). We collected binary and continuous ratings for all models. two-word-test.ipynb can be run as-is, taking LLM_ratings.csv and graves_2013.csv as input. LLM_ratings.csv can be updated by adding the ratings from other LLMs, which must then be specified in the models list within two-word-test.ipynb. A description of each function's purpose and brief explanations of the statistical tests can be found within.

Comments and questions can be posted in the discussion, or emailed to [email protected]

Scripts

graves-gpt-api.ipynb

Sample query to GPT. Prompts and word lists can be edited to suit an experimenter's needs.

LLM_ratings.csv

Once ratings are collected from an LLM, they should be formatted identically to this document. This .csv is fed into two-word-test.ipynb

graves_2013.csv

Human ratings collected by Graves. Also includes similarity metrics for each phrase from some popular word embedding models. This .csv is also fed into two-word-test.ipynb

two-word-test.ipynb

The statistical tests used to compare LLM responses to humans. If changes are made to LLM_ratings.csv or graves_2013.csv, this script will have to be edited accordingly.

two-word-test's People

Contributors

nickriccardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.