GithubHelp home page GithubHelp logo

Comments (3)

doberst avatar doberst commented on May 14, 2024

@carlosandrea: thanks for your question. Here is a quick explanation of each of the fact checks that are being run and what they are testing:

  1. Numbers "Fact Check" - this will review the LLM response output, and look for tokens that appear to be a number value, and then compare with the evidence provided, and look for confirmation that the numerical value can be found in the source materials. If found, then status is "Confirmed" and the corresponding text snippet is provided. The function attempts to match value, even if formulated slightly differently, e.g., "$100.00" should match "100" or "100.00", etc.

  2. Comparison Stats - this function tokenizes the LLM response output, and looks for each of those tokens in the corresponding context evidence. "Confirmed words" are words in the LLM response that are confirmed to be found in the context evidence. "Unconfirmed words" are words in the LLM response that are not matched by the context passage. The percent display and verified match are based on the words in the LLM response, so if the LLM response has three words, and all three words are found in the evidence, then it will be 1.0 or 100%.

  3. Source Review - using a token comparison between the LLM response output and the context evidence, the function attempts to extract statistically high-matching text snippets that were the basis of the LLM's response.

Hope this clarifies how the functions are intended to work.

Appreciate you sharing the individual screen shot responses -> we will review each of them in detail and revert back...

from llmware.

carlosandrea avatar carlosandrea commented on May 14, 2024

@doberst, thanks for clarifying how to interpret those stats. It's very clear now; I really appreciate it!

Just one last question: In the Medium article, how do you evaluate whether the answer is correct for extracting the following stats manually? Do you use another language model? I'm under the impression that Numbers Fact Check, Comparison Stats, and Source Review alone may not be sufficient for assessing the accuracy of the answer.

Evaluated against the benchmark test: RAG-Instruct-Benchmark-Tester
Average of 2 Test Runs with 1 point for correct answer, 0.5 point for partial correct or blank / NF, 0.0 points for incorrect, and -1 points for hallucinations.

  • Accuracy Score: 84.50 correct out of 100
  • Not Found Classification: 20.0%
  • Boolean: 66.25%
  • Math/Logic: 9.4%
  • Complex Questions (1–5): 1 (Low)
  • Summarization Quality (1–5): 3 (Coherent, extractive)
  • Hallucinations: No hallucinations observed in test runs.

from llmware.

doberst avatar doberst commented on May 14, 2024

@carlosandrea - yes, to answer (belatedly), we did a manual review to prepare the test scores for each model. The tools are useful to help accelerate, but in the case of preparing the scores for the Dragon and BLING models, we supplemented with a manual review. If you look in the files for each model (in the corresponding HuggingFace repository), you will see the actual scoring sheet and results for each model.

from llmware.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.