GithubHelp home page GithubHelp logo

dagelf / llmtest_needleinahaystack Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gkamradt/llmtest_needleinahaystack

0.0 0.0 0.0 1.65 MB

Doing simple retrieval from LLM models at various context lengths to measure accuracy

License: Other

Python 12.90% Jupyter Notebook 87.10%

llmtest_needleinahaystack's Introduction

Needle In A Haystack - Pressure Testing LLMs

Supported model providers: OpenAI, Anthropic

A simple 'needle in a haystack' analysis to test in-context retrieval ability of long context LLMs.

Get the behind the scenes on the overview video.

GPT-4-128 Context Testing

git clone https://github.com/gkamradt/LLMTest_NeedleInAHaystack.git

The Test

  1. Place a random fact or statement (the 'needle') in the middle of a long context window (the 'haystack')
  2. Ask the model to retrieve this statement
  3. Iterate over various document depths (where the needle is placed) and context lengths to measure performance

This is the code that backed this OpenAI and Anthropic analysis.

If ran and save_results = True, then this script will populate a result/ directory with evaluation information. Due to potential concurrent requests each new test will be saved as a few file.

I've put the results from the original tests in /original_results. I've upgraded the script since those test were ran so the data formats may not match your script results.

The key parameters:

  • needle - The statement or fact which will be placed in your context ('haystack')
  • haystack_dir - The directory which contains the text files to load as background context. Only text files are supported
  • retrieval_question - The question with which to retrieve your needle in the background context
  • results_version - You may want to run your test multiple times for the same combination of length/depth, change the version number if so
  • context_lengths_min - The starting point of your context lengths list to iterate
  • context_lengths_max - The ending point of your context lengths list to iterate
  • context_lengths_num_intervals - The number of intervals between your min/max to iterate through
  • document_depth_percent_min - The starting point of your document depths. Should be int > 0
  • document_depth_percent_max - The ending point of your document depths. Should be int < 100
  • document_depth_percent_intervals - The number of iterations to do between your min/max points
  • document_depth_percent_interval_type - Determines the distribution of depths to iterate over. 'linear' or 'sigmoid
  • model_provider - 'OpenAI' or 'Anthropic'
  • model_name - The name of the model you'd like to test. Should match the exact value which needs to be passed to the api. Ex: gpt-4-1106-preview
  • save_results - Whether or not you'd like to save your results to file. They will be temporarily saved in the object regardless. True/False
  • save_contexts - Whether or not you'd like to save your contexts to file. Warning these will get very long. True/False

Other Parameters:

  • context_lengths - A custom set of context lengths. This will override the values set for context_lengths_min, max, and intervals if set
  • document_depth_percents - A custom set of document depths lengths. This will override the values set for document_depth_percent_min, max, and intervals if set
  • openai_api_key - Must be supplied. GPT-4 is used for evaluation. Can either be passed when creating the object or an environment variable
  • anthropic_api_key - Only needed if testing Anthropic models. Can either be passed when creating the object or an environment variable
  • num_concurrent_requests - Default: 1. Set higher if you'd like to run more requests in parallel. Keep in mind rate limits.
  • final_context_length_buffer - The amount of context to take off each input to account for system messages and output tokens. This can be more intelligent but using a static value for now. Default 200 tokens.
  • seconds_to_sleep_between_completions - Default: None, set # of seconds if you'd like to slow down your requests
  • print_ongoing_status - Default: True, whether or not to print the status of test as they complete

Results Visualization

LLMNeedleInHaystackVisualization.ipynb holds the code to make the pivot table visualization. The pivot table was then transferred to Google Slides for custom annotations and formatting. See the google slides version. See an overview of how this viz was created here.

OpenAI's GPT-4-128K (Run 11/8/2023)

GPT-4-128 Context Testing

Anthropic's Claude 2.1 (Run 11/21/2023)

GPT-4-128 Context Testing

License

This project is licensed under the MIT License - see the LICENSE file for details. Use of this software requires attribution to the original author and project, as detailed in the license.

llmtest_needleinahaystack's People

Contributors

gkamradt avatar eltociear avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.