GithubHelp home page GithubHelp logo

eitanturok / tooltalk Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microsoft/tooltalk

0.0 0.0 1.0 324 KB

Evaluating tool-augmented LLMs in conversation settings

License: MIT License

Shell 0.91% Python 99.09%

tooltalk's Introduction

๐Ÿ”ง ToolTalk ๐Ÿ’ฌ

๐Ÿ“„ Paper | ๐Ÿ“ซ Contact

Introducing ToolTalk a benchmark for evaluating Tool LLMs in a conversational setting.

Details

ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot, an increasingly popular paradigm for everyday users to harness the power of LLMs. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations. We annotate these conversations to contain a ground truth usage of 28 unique tools belonging to 7 themed "plugins".

Evaluation consists of prompting an LLM to predict the correct sequence of tools after every user utterance in a conversation. Thus, evaluating on a single conversation requires an LLM to correctly predict multiple sub-tasks. Predictions are compared against the ground truth to determine success for a single conversation.

We evaluate two chatbots on ToolTalk powered by gpt-3.5-turbo-0613 and gpt-4-0613 implemented by using the chat completions API from OpenAI.

Model ToolTalk Success rate Precision Recall Incorrect Action Rate
GPT-3.5 Easy 85.7% 42.4% 89.3% 5.0%
GPT-4 Easy 92.8% 69.2% 96.4% 3.8%
GPT-3.5 Hard 26.0% 54.6% 69.7% 23.9%
GPT-4 Hard 50.0% 74.9% 79.0% 25.1%

Setup

ToolTalk can be setup using the following commands. Install local package with dev dependencies to enable unit tests.

pip install -r requirements.txt
pip install -e ".[dev]"

To verify that the installation was successful, run the unit tests.

pytest tests

Reproducing the results

The results on GPT-3.5-turbo and GPT-4 can be reproduced using the following commands. This requires having access to OpenAI's API. The results will be saved in the results folder. The script caches intermediary results, so it can be re-run if it is interrupted for any reason.

export OPENAI_API_KEY=<your key>
bash evaluate_gpt35turbo.sh
bash evaluate_gpt4.sh

Your results should look something like the number above, there will be some variance due to both models having non-deterministic results.

Generating scenarios

To generate new scenarios, you can use the following command.

python -m tooltalk.generation.scenario_generator --prompt src/prompts/scenario_template.md --output_dir output/scenarios

Evaluating on new models

The easiest way to evaluate on new models would be to create a new Predictor class that inherits from tooltalk.evaluation.tool_executor.BaseAPIPredictor. For an example of how to do this, see tooltalk.evaluation.tool_executor.GPT3Predictor and tooltalk.evaluation.evaluate_openai.OpenAIPredictor.

Citing

@article{farn2023tooltalk,
  title={ToolTalk: Evaluating Tool-Usage in a Conversation Setting},
  author={Nicholas Farn and Richard Shin},
  year={2023},
  journal={arXiv preprint arXiv:2311.10775},
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

tooltalk's People

Contributors

nifarn avatar chokkyvista avatar microsoftopensource avatar eitanturok avatar microsoft-github-operations[bot] avatar

Forkers

pkmmlxdb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.