mddunlap924 / langchain-syndata-rag-eval Goto Github PK

LangChain, Llama2-Chat, and zero- and few-shot prompting are used to generate synthetic datasets for IR and RAG system evaluation

License: MIT License

Jupyter Notebook 96.03% Python 3.97%

information-retrieval llama2 llm nlp prompts-template rag retrieval-augmented-generation few-shot langchain prompt prompt-engineering question-answering

langchain-syndata-rag-eval's Introduction

Synthetic Data Generation using LangChain for IR and RAG Evaluation

This repository demonstrates LangChain, Llama2-Chat, and zero- and few-shot prompt engineering to enable synthetic data generation for Information Retrieval (IR) and Retrieval Augmented Generation (RAG) evaluation.

Introduction • Highlights • Example Notebooks • Background • Metrics • Benefits • Prompt Templates • Issues • TODOs

Introduction

Large language models (LLMs) have transformed Information Retrieval (IR) and search by comprehending complex queries. This repository showcases concepts and packages that can be used to generate sophisticated synthetic datasets for IR and Retrieval Augmented Generation (RAG) evaluation.

The synthetic data generated is a query and answer for a given context. An example of a synthetically generated context-query-answer is shown below:

Provided Context (usually split from documents / text sources): 
Pure TalkUSA is an American mobile virtual network operator headquartered in Covington, Georgia, United States. 
It is most notable for an industry-first offering of rollover data in their data add-on packages, which has since been discontinued. 
Pure TalkUSA is a subsidiary of Telrite Corporation. Bring Your Own Phone! 

Synthetically Generated Query: 
What was the outstanding service offered by Pure TalkUSA?

Synthetically Generated Answer:
The outstanding service from Pure TalkUSA was its industry-first offering of rollover data.

When building an IR or RAG system, a dataset of context, queries, and answers is vital for evaluating the system's performance. Human-annotated datasets offer excellent ground truths but can be expensive and challenging to obtain; therefore, synthetic datasets generated using LLMs is an attractive solution and supplement.

By employing LLM prompt engineering, a diverse range of synthetic queries and answers can be generated to form a robust validation dataset. This repository showcases a process to generate synthetic data while emphasizing zero- and few-shot prompting for creating highly customizable synthetic datasets. Figure 1 outlines the synthetic dataset generation process demonstrated in this repository.

Figure 1: Synthetic Data Generation for IR and RAG Evaluation

NOTE: Refer to the Background and Metrics sections for a deeper dive on IR, RAG, and how to evaluate these systems.

Highlights

A few of the key highlights in repository are:

Local LLM models on consumer grade hardware are exclusively used throughout and no external API calls are performed. This is paramount for data privacy. Also, several online examples utilize external API calls to State-of-the-Art (SOTA) LLMs which generally provide higher quality results than local LLMs with less parameters. This causes certain challenges in coding and error handling for local models and solutions are shown here.
Zero- and Few-Shot Prompting for highly customizable query and answer generation are presented.
LangChain examples using:
- Custom prompt engineering,
- Output parsers and auto-fixing parsers to obtain structured data,
- Batch GPU inference with chains,
- LangChain Expression Language (LCEL).
Quantization for reducing model size onto consumer grade hardware.

Example Notebooks

Context-Query-Answer Generation with LangChain

1.) LangChain with Custom Prompts and Output Parsers for Structured Data Output: see gen-question-answer-query.ipynb for an example of synthetic context-query-answer data generation. Key aspects of this notebook are:

LangChain Custom Prompt Template for a Llama2-Chat model
PydanticOutputParser
OutputFixingParser
Custom output parser classes are written to accommodate Llama2-Chat and error handling.

Context-Query Generation with LangChain

1.) LangChain Custom Llama2-Chat Prompting: See qa-gen-query-langchain.ipynb for an example of how to build LangChain Custom Prompt Templates for context-query generation. A few of the LangChain features shown in this notebook are:

LangChain Custom Prompt Template for a Llama2-Chat model
Hugging Face Local Pipelines
4-Bit Quantization
Batch GPU Inference

Context-Query Generation without LangChain

1.) Zero- and Few-Shot Prompt Engineering: See qa-gen-query.ipynb for an example of synthetic context-query data generation for custom datasets. Key features presented here are:

Prompting LLMs using zero- and few-shot annotations on the SquadV2 question-answering dataset.
Demonstrates Two prompting techniques:
- Basic zero-shot query generation which is referred to as vanilla
- Few-shot with Guided by Bad Questions (GBQ)

2.) Context-Arugment: See argument-gen-query.ipynb for examples of synthetic context-query data for argument retrieval tasks. In the context of information retrieval, these tasks are designed to retrieve relevant arguments from various sources such as documents. In argument retrieval the goal is to provide users with persuasive and credible information to support their arguments or make informed decisions.

Non-Llama Query Generation

Other examples of query specific generation models (e.g., BeIR/query-gen-msmarco-t5-base-v1) can readily be found online (see BEIR Question Generation).

Background

The primary function of an IR system is retrieval, which aims to determine the relevance between a users' query and the content to be retrieved. Implementing an IR or RAG system demands user-specific documents. However, lacking annotated datasets for custom datasets hampers system evaluation. Figure 2 provides an overview of a typical RAG process for a question-answering system.

Figure 2: RAG process overview [Source].

This synthetic context-query-answer datasets are crucial for evaluating: 1) the IR's systems ability to select the enhanced context as illustrated in Figure 2 - Step #3, and 2) the RAG's generated response as shown in Figure 2 - Step #5. By allowing offline evaluation, it enables a thorough analysis of the system's balance between speed and accuracy, informing necessary revisions and selecting champion system designs.

The design of IR and RAG systems are becoming more complicated as referenced in Figure 3.

Figure 3: LLMs can be used in query rewriter, retriever, reranker, and reader [Source]

As shown their are several considerations in IR / RAG design and solutions can range in complexity from traditional methods (e.g., term-based sparse methods) to neural based methods (e.g., embeddings and LLMs). Evaluation of these systems is critical to making well-informed design decisions. From search to recommendations, evaluation measures are paramount to understanding what does and does not work in retrieval.

Metrics

Question-Answering (QA) systems (e.g., RAG system) have two components:

Retriever - which retrieves the most relevant information needed to answer the query
Generator - which generates the answer with the retrieved information.

When evaluating a QA system both components need to be evaluated separately and together to get an overall system score.

Whenever a question is asked to a RAG application, the following objects can be considered [Source]:

The question
The correct answer to the question
The answer that the RAG application returned
The context that the RAG application retrieved and used to answer the question

The selection of metrics is not a primary focus of this repository since metrics are application dependent; however reference articles and information are provided for convenience.

Retriever Metrics

Figure 4 shows common evaluation metrics for IR and the Dataset from Figure 1 can be used for the Offline Metrics shown in Figure 4.

Figure 4: Ranking evaluation metrics [Source]

Offline metrics are measured in an isolated environment before deploying a new IR system. These look at whether a particular set of relevant results are returned when retrieving items with the system [Source].

Generator Metrics

A brief review of generator metrics will showcase a few tiers of metric complexity. When evaluating the generator, look at whether, or to what extent, the selected answer passages match the correct answer or answers.

Provided below are generator metrics listed in order of least to most complex.

Traditional: metrics such as F1, Accuracy, Exact Match, ROGUE, BLEU, etc. can be performed but these will lack correlation with human judgement; however, they do offer simple and quick quantitative comparisons.
Semantic Answer Similarity: encoder models like SAS, BERT, and other models available on Sentence-Transformers. These are trained models that return similarity scores.
Using LLMs to evaluate themselves: this is the inner workings of popular RAG evaluation packages like Ragas and TonicAI/tvalmetrics.
- Refer to research paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena for more details.

Please refer to the article Deepset: Metrics to Evaluate a Question Answering System and Evaluating RAG pipelines with Ragas + LangSmith that elaborate on these metrics.

Benefits

A few key benefits of synthetic data generation with LLM prompt engineering are:

Customized IR Task Query Generation: Prompting LLMs offer great flexibility in the types of queries that can be generated. This is helpful because IR tasks vary in their application. For example, Benchmarking-IR (BEIR) is a heterogeneous benchmark containing diverse IR tasks such as question-answering, argument or counter argument retrieval, fact checking, etc. Due to the diversity in IR tasks this is where the benefits of LLM prompting can excellence because the prompt can be tailored to generate synthetic data to the IR task. Figure 5 shows an overview of the diverse IR tasks and datasets in BEIR. Refer to the BEIR leaderboard to see the performance of NLP-based retrieval models.

Figure 5: BEIR benchmark datasets and IR tasks Image taken from [Source]

Zero or Few-Shot Annotations: In a technique referred to as zero or few-shot prompting, developers can provide domain-specific example queries to LLMs, greatly enhancing query generation. This approach often requires only a handful of annotated samples.
Longer Context Length: GPT-based LLM models, like Llama2, provide extended context lengths, up to 4,096 tokens compared to BERT's 512 tokens. This longer context enhances document parsing and query generation control.

Prompt Templates

Llama2 will be used in this repository for generating synthetic queries because it can be ran locally on consumer grade GPUs. Shown below is the prompt template for Llama2 Chat which was fine-tuned for dialogue and instruction applications.

<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]

System Prompt: A system prompt <<SYS>> is one of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs [Source].
User Message: The query or message provided by the user. The [INST] and [/INST] help identify what was typed by the user so Llama knows how to respond properly. Without these markers around the user text, Llama may get confused about whose turn it is to reply.

Note that base Llama2 models have no prompt structure because they are raw non-instruct tuned models [Source].

Additional resources and references to help with prompting techniques and basics:

LLaMA 2 - Every Resource you need
Prompt Engineering Guide
Llama2 Prompt Template
In this repository refer to the directory notes-references for more details on Prompt Engineering and Consistency Filtering.

Issues

This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or submit a Pull Request. 😃

TODOs

DeepSpeed ZeRO-Inference Offload massive LLM weights to non-GPU resources for running +70B models on consumer grade hardware.
Feel free to raise an Issue for a feature you would like to see added.

Liked the work? Please give a star!

langchain-syndata-rag-eval's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes wilfoderek dhanrajdadhich78 aiwithatharva stevew00ds

langchain-syndata-rag-eval's Issues

Squad_v2 code in python

Hi,I tried to run the Load Squad_v2 Data which is in the block of " Generate Synthetic Context-Query-Answer Results using Batch GPU Inference" from gen-question-answer-langchain.ipynb ,and got FileNotFoundError .
`{
"name": "FileNotFoundError",
"message": "Couldn't find a dataset script at /code/Lit-llama/LangChain-SynData-RAG-Eval/data/squad_v2/squad_v2.py or any data file in the same directory.",
"stack": "---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[12], line 15
12 paths = SimpleNamespace(**paths)
14 # Load squad_v2 data locally from disk
---> 15 df = load_dataset(str(paths.base_dir / paths.squad_data),
16 split='train').to_pandas()
18 # Remove redundant context
19 df = df.drop_duplicates(subset=['context', 'title']).reset_index(drop=True)

File ~/miniconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py:2556, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2551 verification_mode = VerificationMode(
2552 (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
2553 )
2555 # Create a dataset builder
-> 2556 builder_instance = load_dataset_builder(
2557 path=path,
2558 name=name,
2559 data_dir=data_dir,
2560 data_files=data_files,
2561 cache_dir=cache_dir,
2562 features=features,
2563 download_config=download_config,
2564 download_mode=download_mode,
2565 revision=revision,
2566 token=token,
2567 storage_options=storage_options,
2568 trust_remote_code=trust_remote_code,
2569 _require_default_config_name=name is None,
2570 **config_kwargs,
2571 )
2573 # Return iterable dataset in case of streaming
2574 if streaming:

File ~/miniconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py:2228, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
2226 download_config = download_config.copy() if download_config else DownloadConfig()
2227 download_config.storage_options.update(storage_options)
-> 2228 dataset_module = dataset_module_factory(
2229 path,
2230 revision=revision,
2231 download_config=download_config,
2232 download_mode=download_mode,
2233 data_dir=data_dir,
2234 data_files=data_files,
2235 cache_dir=cache_dir,
2236 trust_remote_code=trust_remote_code,
2237 _require_default_config_name=_require_default_config_name,
2238 _require_custom_configs=bool(config_kwargs),
2239 )
2240 # Get dataset builder class from the processing script
2241 builder_kwargs = dataset_module.builder_kwargs

File ~/miniconda3/envs/llama/lib/python3.10/site-packages/datasets/load.py:1881, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
1879 raise e1 from None
1880 else:
-> 1881 raise FileNotFoundError(
1882 f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory."
1883 )

FileNotFoundError: Couldn't find a dataset script at /code/Lit-llama/LangChain-SynData-RAG-Eval/data/squad_v2/squad_v2.py or any data file in the same directory."
}`
I wonder that is there any python called "squad_v2.py" that can download squad_v2 dataset in your local directory?

Hugging Face model is not exist

Hi. I previously tried to run the notebook you provided, but I encountered the following error when downloading the model:

`{
"name": "OSError",
"message": "Can't load the configuration of '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf' is the correct path to a directory containing a config.json file",
"stack": "---------------------------------------------------------------------------
HFValidationError Traceback (most recent call last)
File ~/miniconda3/envs/llama/lib/python3.11/site-packages/transformers/configuration_utils.py:675, in PretrainedConfig._get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
673 try:
674 # Load from local folder or from cache or download from model Hub and cache
--> 675 resolved_config_file = cached_file(
676 pretrained_model_name_or_path,
677 configuration_file,
678 cache_dir=cache_dir,
679 force_download=force_download,
680 proxies=proxies,
681 resume_download=resume_download,
682 local_files_only=local_files_only,
683 token=token,
684 user_agent=user_agent,
685 revision=revision,
686 subfolder=subfolder,
687 _commit_hash=commit_hash,
688 )
689 commit_hash = extract_commit_hash(resolved_config_file, commit_hash)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/transformers/utils/hub.py:429, in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
427 try:
428 # Load from URL or cache if already cached
--> 429 resolved_file = hf_hub_download(
430 path_or_repo_id,
431 filename,
432 subfolder=None if len(subfolder) == 0 else subfolder,
433 repo_type=repo_type,
434 revision=revision,
435 cache_dir=cache_dir,
436 user_agent=user_agent,
437 force_download=force_download,
438 proxies=proxies,
439 resume_download=resume_download,
440 token=token,
441 local_files_only=local_files_only,
442 )
443 except GatedRepoError as e:

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:110, in validate_hf_hub_args.._inner_fn(*args, **kwargs)
109 if arg_name in ["repo_id", "from_id", "to_id"]:
--> 110 validate_repo_id(arg_value)
112 elif arg_name == "token" and arg_value is not None:

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py:158, in validate_repo_id(repo_id)
157 if repo_id.count("/") > 1:
--> 158 raise HFValidationError(
159 "Repo id must be in the form 'repo_name' or 'namespace/repo_name':"
160 f" '{repo_id}'. Use repo_type argument if needed."
161 )
163 if not REPO_ID_REGEX.match(repo_id):

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
Cell In[25], line 14
6 bnb_config = BitsAndBytesConfig(
7 load_in_4bit=True,
8 bnb_4bit_quant_type='nf4',
9 bnb_4bit_use_double_quant=True,
10 bnb_4bit_compute_dtype=bfloat16
11 )
13 # Model
---> 14 model_config = AutoConfig.from_pretrained(model_id)
15 model = AutoModelForCausalLM.from_pretrained(
16 model_id,
17 trust_remote_code=True,
(...)
20 device_map='auto',
21 )
22 model.eval() # set to evaluation for inference only

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py:1034, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
1031 trust_remote_code = kwargs.pop("trust_remote_code", None)
1032 code_revision = kwargs.pop("code_revision", None)
-> 1034 config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
1035 has_remote_code = "auto_map" in config_dict and "AutoConfig" in config_dict["auto_map"]
1036 has_local_code = "model_type" in config_dict and config_dict["model_type"] in CONFIG_MAPPING

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/transformers/configuration_utils.py:620, in PretrainedConfig.get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
618 original_kwargs = copy.deepcopy(kwargs)
619 # Get config dict associated with the base config file
--> 620 config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
621 if "_commit_hash" in config_dict:
622 original_kwargs["_commit_hash"] = config_dict["_commit_hash"]

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/transformers/configuration_utils.py:696, in PretrainedConfig._get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
693 raise
694 except Exception:
695 # For any other exception, we throw a generic error.
--> 696 raise EnvironmentError(
697 f"Can't load the configuration of '{pretrained_model_name_or_path}'. If you were trying to load it"
698 " from 'https://huggingface.co/models', make sure you don't have a local directory with the same"
699 f" name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory"
700 f" containing a {configuration_file} file"
701 )
703 try:
704 # Load config dict
705 config_dict = cls._dict_from_json_file(resolved_config_file)

OSError: Can't load the configuration of '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf' is the correct path to a directory containing a config.json file"
}`
I attempted to locate the model using the path "/nvme4tb/Projects/llama2_models/Llama-2-7b-chat-hf" on HuggingFace, but couldn't find any matching entries. Could there possibly be an alternative path available for download?Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble