Comments (3)
@carlosandrea: thanks for your question. Here is a quick explanation of each of the fact checks that are being run and what they are testing:
-
Numbers "Fact Check" - this will review the LLM response output, and look for tokens that appear to be a number value, and then compare with the evidence provided, and look for confirmation that the numerical value can be found in the source materials. If found, then status is "Confirmed" and the corresponding text snippet is provided. The function attempts to match value, even if formulated slightly differently, e.g., "$100.00" should match "100" or "100.00", etc.
-
Comparison Stats - this function tokenizes the LLM response output, and looks for each of those tokens in the corresponding context evidence. "Confirmed words" are words in the LLM response that are confirmed to be found in the context evidence. "Unconfirmed words" are words in the LLM response that are not matched by the context passage. The percent display and verified match are based on the words in the LLM response, so if the LLM response has three words, and all three words are found in the evidence, then it will be 1.0 or 100%.
-
Source Review - using a token comparison between the LLM response output and the context evidence, the function attempts to extract statistically high-matching text snippets that were the basis of the LLM's response.
Hope this clarifies how the functions are intended to work.
Appreciate you sharing the individual screen shot responses -> we will review each of them in detail and revert back...
from llmware.
@doberst, thanks for clarifying how to interpret those stats. It's very clear now; I really appreciate it!
Just one last question: In the Medium article, how do you evaluate whether the answer is correct for extracting the following stats manually? Do you use another language model? I'm under the impression that Numbers Fact Check, Comparison Stats, and Source Review alone may not be sufficient for assessing the accuracy of the answer.
Evaluated against the benchmark test: RAG-Instruct-Benchmark-Tester
Average of 2 Test Runs with 1 point for correct answer, 0.5 point for partial correct or blank / NF, 0.0 points for incorrect, and -1 points for hallucinations.
- Accuracy Score: 84.50 correct out of 100
- Not Found Classification: 20.0%
- Boolean: 66.25%
- Math/Logic: 9.4%
- Complex Questions (1–5): 1 (Low)
- Summarization Quality (1–5): 3 (Coherent, extractive)
- Hallucinations: No hallucinations observed in test runs.
from llmware.
@carlosandrea - yes, to answer (belatedly), we did a manual review to prepare the test scores for each model. The tools are useful to help accelerate, but in the case of preparing the scores for the Dragon and BLING models, we supplemented with a manual review. If you look in the files for each model (in the corresponding HuggingFace repository), you will see the actual scoring sheet and results for each model.
from llmware.
Related Issues (20)
- Creating embedding with MongoDB text store when library contains CSV file fails HOT 7
- Add class docstring to module retrieval
- SLIM Models - OSError: [WinError -1073741795] Windows Error 0xc000001d in 0.2.4 HOT 11
- JSON files not being parsed and are being rejected HOT 6
- Add class docstrings to module prompts HOT 1
- quickstart_rag_colab.ipynb
- streamlit and other UI examples HOT 1
- google colab examples and start up scripts HOT 1
- jupyter notebook - more examples and better support HOT 2
- Add Cohere Command R model
- GGUF models not utilising GPU on Windows HOT 2
- PDF files getting rejected in parse step HOT 4
- Can I use SLIM-Agents for german language?
- Error in Prompt.load(from_hf) : model_card (NoneType) is not iterable HOT 2
- llmware.exceptions.ModelNotFoundException: HOT 6
- move llmware base directory HOT 1
- Azure OpenAI Integration HOT 3
- Issue with the spelling of between
- New
- Prompt with sources example: create_new_library Error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llmware.