Hello, I just ran dragon_rag_benchmark_tests_llmware.py, on llmware/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to interpret fact_check and comparaison stats ? about llmware HOT 3 CLOSED

carlosandrea commented on May 14, 2024

How to interpret fact_check and comparaison stats ?

from llmware.

Comments (3)

doberst commented on May 14, 2024

@carlosandrea: thanks for your question. Here is a quick explanation of each of the fact checks that are being run and what they are testing:

Numbers "Fact Check" - this will review the LLM response output, and look for tokens that appear to be a number value, and then compare with the evidence provided, and look for confirmation that the numerical value can be found in the source materials. If found, then status is "Confirmed" and the corresponding text snippet is provided. The function attempts to match value, even if formulated slightly differently, e.g., "$100.00" should match "100" or "100.00", etc.
Comparison Stats - this function tokenizes the LLM response output, and looks for each of those tokens in the corresponding context evidence. "Confirmed words" are words in the LLM response that are confirmed to be found in the context evidence. "Unconfirmed words" are words in the LLM response that are not matched by the context passage. The percent display and verified match are based on the words in the LLM response, so if the LLM response has three words, and all three words are found in the evidence, then it will be 1.0 or 100%.
Source Review - using a token comparison between the LLM response output and the context evidence, the function attempts to extract statistically high-matching text snippets that were the basis of the LLM's response.

Hope this clarifies how the functions are intended to work.

Appreciate you sharing the individual screen shot responses -> we will review each of them in detail and revert back...

from llmware.

carlosandrea commented on May 14, 2024

@doberst, thanks for clarifying how to interpret those stats. It's very clear now; I really appreciate it!

Just one last question: In the Medium article, how do you evaluate whether the answer is correct for extracting the following stats manually? Do you use another language model? I'm under the impression that Numbers Fact Check, Comparison Stats, and Source Review alone may not be sufficient for assessing the accuracy of the answer.

Evaluated against the benchmark test: RAG-Instruct-Benchmark-Tester
Average of 2 Test Runs with 1 point for correct answer, 0.5 point for partial correct or blank / NF, 0.0 points for incorrect, and -1 points for hallucinations.

Accuracy Score: 84.50 correct out of 100

Not Found Classification: 20.0%

Boolean: 66.25%

Math/Logic: 9.4%

Complex Questions (1–5): 1 (Low)

Summarization Quality (1–5): 3 (Coherent, extractive)

Hallucinations: No hallucinations observed in test runs.

from llmware.

doberst commented on May 14, 2024

@carlosandrea - yes, to answer (belatedly), we did a manual review to prepare the test scores for each model. The tools are useful to help accelerate, but in the case of preparing the scores for the Dragon and BLING models, we supplemented with a manual review. If you look in the files for each model (in the corresponding HuggingFace repository), you will see the actual scoring sheet and results for each model.

from llmware.

How to interpret fact_check and comparaison stats ? about llmware HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs