GithubHelp home page GithubHelp logo

rag-from-scratch's Introduction

RAG From Scratch

LLMs are trained on a large but fixed corpus of data, limiting their ability to reason about private or recent information. Fine-tuning is one way to mitigate this, but is often not well-suited for facutal recall and can be costly. Retrieval augmented generation (RAG) has emerged as a popular and powerful mechanism to expand an LLM's knowledge base, using documents retrieved from an external data source to ground the LLM generation via in-context learning. These notebooks accompany a video playlist that builds up an understanding of RAG from scratch, starting with the basics of indexing, retrieval, and generation. rag_detail_v2

rag-from-scratch's People

Contributors

rlancemartin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rag-from-scratch's Issues

Part 6: Questions about implementation of RAG-Fusion

Hi @rlancemartin,

I noticed that though the author describes the implementation of RAG-Fusion using also original query for retrieval stage here and here, none of the implementations (from the author, this webinar and langchain template) contain original query.

Could you please tell have you experimented with adding of the original query with more weight (as author describes in his article)?

Do I understand correctly that then we will assume the score=1 for all chunks retrieved for original query before we apply RRF to the ranked documents?

Or maybe the following approach will be more robust and consistent:

  • After retrieval stage (including original query), for each query rerank all the results using reranker
  • Apply RRF to reranked documents

With this approach it seems that the model will choose if the results are relevant to the specific query (using reranker), including the original query.

It will be great to hear you opinion about that.

Thank you.

Part 9: Questions about implementation of HyDE

Hi @rlancemartin,

I have read the original paper about HyDE and noticed (in sections 3.2 and 4.1) that authors use multiple document generations with temperature 0.7 and the question itself to calculate the final query embeddings which will be used for real documents retrieval (by calculating the mean of these embeddings).

Also I found that implementation from the documentation link provided is probably outdated due to usage of OpenAI model, deprecated chain and without using LCEL. Also id doesn't use query embeddings for final query embeddings calculation.

Since the steps in Part 9 are also not combined in the single LCEL chain, I tried to implement it myself considering all the comments above and wrote the following code (assuming that we already have vectorstore with documents):

from functools import partial
from operator import itemgetter

from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel
from langchain.prompts import ChatPromptTemplate
import numpy as np


def generate_docs(arguments):
    question = arguments['question']
    generation_template = arguments['template']
    n = arguments['n']
    prompt_hyde = ChatPromptTemplate.from_template(generation_template)
    generate_docs_for_retrieval = (
        prompt_hyde 
        | ChatOpenAI(model='gpt-3.5-turbo-0125', temperature=0.7) 
        | StrOutputParser()
    )
    generated_docs = generate_docs_for_retrieval.batch([{'question': question}] * n)
    return generated_docs

def calculate_query_embeddings(query_components):
    question = query_components['question']
    generated_docs = query_components['docs']
    
    question_embeddings = np.array(embeddings.embed_query(question))
    generated_docs_embeddings = np.array(embeddings.embed_documents(generated_docs))
    
    query_embeddings = np.vstack([question_embeddings, generated_docs_embeddings])
    calculated_query_embeddings = np.mean(query_embeddings, axis=0, keepdims=True)
    return calculated_query_embeddings

def get_relevant_documents(query_embeddings, vectorstore, search_kwargs):
    return vectorstore.similarity_search_by_vector(query_embeddings, **search_kwargs)

search_kwargs = {'k': 4}
get_relevant_documents = partial(get_relevant_documents, vectorstore=vectorstore, search_kwargs=search_kwargs)

rag_template = """Answer the following question based on this context:
{context}

Question: {question}
"""
rag_prompt = ChatPromptTemplate.from_template(rag_template)

model = ChatOpenAI(model='gpt-3.5-turbo-0125', temperature=0)

chain = (
    RunnableParallel(
        {
            'question': itemgetter('question'),
            'context':
                RunnableParallel({
                    'question': itemgetter('question'),
                    'docs': generate_docs
                })
                | calculate_query_embeddings
                | get_relevant_documents,
        }
    )
    | rag_prompt
    | model
    | StrOutputParser()
)

generation_template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
question = "What is task decomposition for LLM agents?"
n = 4

response = chain.invoke({
    'question': question,
    'template': generation_template,
    'n': n,
})
print(response)

I decided to use batch() method of the Runnable to generate multiple documents, because I found that implementation of invoke() method always get only the first generation regardless of the n argument of the ChatOpenAI model (but all n generations are created and will increase the cost of the invocation).

It would be great to get feedback from you about implementation details from the paper (about using multiple documents and query itself for embeddings calculation), about this implementation which I provided (maybe you will recommend more effective solution because with batch() method we need to send prompt tokens with each request) and about the invoke() implementation (why it returns only the first generation, and maybe there is more cost-effective solution than batch() if we can't use invoke() for multiple generations).

Thank you.

Part 7: Decomposition, part one, why dont answer initial question?

Hey,

question = "What are the main components of an LLM-powered autonomous agent system?"
questions = generate_queries_decomposition.invoke({"question":question})

My question is, why don't we put the original question into the QA chain later?

questions.append(question) or questions + [question]
My understanding is that the task must ultimately answer the original question based on several previous multi-step Q&As, and the original query could include instruction on how to parse or format data.

BR

AttributeError: 'Client' object has no attribute 'pull_repo'

I am having this issue for the hub.pull method

RETRIEVAL and GENERATION

Prompt

prompt = hub.pull("rlm/rag-prompt")

LLM

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

Post-processing

def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)

Chain

rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

Question

rag_chain.invoke("What is Task Decomposition?")

Does anyone have any suggestions? Thank you!

retriever.get_relevant_documents is broken. tutorial (PART 1-4)

Following the tutorial (PART 1-4), I noticed that the model answered with "not enough information to answer". Looking at "docs" I get this output:
[Document(page_content='Conversatin samples:\n[\n {\n "role": "system",', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'})]
Seems like docs = retriever.get_relevant_documents("What is Task Decomposition?") is not generating a correct page_content.
If I invoke the chain for the whole context, aka splits, not docs I do get a meaningful answer.
chain.invoke({"context":splits,"question":"What is Task Decomposition?"})

Note: I'm using using googles API, but I would expect that to be irrelevant.

License file missing

Hi @rlancemartin,

Thanks a lot for the comprehensive walkthrough. I'd like to show some of the content in a talk I'm preparing at the moment, but could not find any information on the license. Could you provide a license file or absent of this give some information under which conditions it is possible for me to showcase content from the repository?

mutliple documents retrievers

I've gone through the notebooks in the repo what I have observed is you used single blog but my question is what if we have around 4-5 different sources of info which might be relevant or irrelevant but if we ask a query it has to give solution regarded to the specific document passed, I checked the rag_from_scratch_5_to_9.ipynb notebook for that but the flow diagram and the implementation doesn't seem to match somehow you just used one blog, what if we have multiple docs from multiple sources relevant or irrelevant, how to do that with Lang chain?

Is implementation of Part 7: Decomposition really the implementation of Least-To-Most prompting method?

Hi @rlancemartin,

I have a question about the implementation of Part 7 where you are referring to the paper about Least-To-Most Prompting from Google.

You mentioned that this method can be implemented by processing all generated queries independently, but in the paper the authors wrote about sequential solving the next questions based on the already solved question-answer pairs on the previous stage (Figure 1 from the paper). Based on that it seems that this method from Google can't be parallelized in its original version because each subsequent step depends on all the previous sub-questions and answers.

At 2:00 you mentioned that we can answer sub-questions in isolation, but at 2:20 you told that previous solution will be injected in the context of the next sub-question, but further implementation does not include the previous Q&A pairs to the next question's context.
Do these statements contradict each other?

Do I understand correctly that your implementation is more similar to multi-query approach but instead of all unique documents for all sub-questions you generate Q&A pairs based on that documents which you are using as the context for the final answer, and other logic is exactly the same like in multi-query, so it can be parallelized (like in multi-query section) instead of subsequent processing using for loop?

Also as I understand the core part for both staged of Least-To-Most prompting technique are the examples, and other techniques are optional for it (from the caption of Figure 1 and Section 2 of the paper):

Figure 1: Least-to-most prompting solving a math word problem in two stages: (1) query the language model to decompose the problem into subproblems; (2) query the language model to sequentially solve the subproblems. The answer to the second subproblem is built on the answer to the first

subproblem. The demonstration examples for each stage’s prompt are omitted in this illustration.

Least-to-most prompting can be combined with other prompting techniques like chain-of-thought
(Wei et al., 2022) and self-consistency (Wang et al., 2022b), but does not need to be. Also, for some
tasks, the two stages in least-to-most prompting can be merged to form a single-pass prompt.

Also I found the diagram of the implementation of Multi-Query RAG in README of the LCEL Teacher repository (top part of the screenshot) but as I understand this diagram describes exactly what you implemented in Part 7 here.

Could you please share your thoughts about that? Because I am little confused with this terminology.

Thank you.

[Part2: Indexing] embed_query() or embed_documents()?

Hi @rlancemartin , thanks for your clear tutorial.
I got a simple question about Part 2: Indexing.

query_result = embd.embed_query(question)
document_result = embd.embed_query(document)

Why not use "embed_documents([document])" here? Forgive me for not being very clear about these functions.

NotImplementedError: While using AzureChatOpenAI (RAG from scratch: Part 10 (Routing))

Hi team,

My Organizational need requires using Azure Services for building a RAG model and while building Routing model I encountered an issue. Below is my code:

Importing libraries

import os
from dotenv import load_dotenv
import certifi
# loading .env file
load_dotenv('.env')
api_base = os.getenv("AZURE_OPENAI_ENDPOINT") 
api_type = os.getenv("OPENAI_API_TYPE")
api_version = os.getenv("OPENAI_API_VERSION")
api_key = os.getenv("OPENAI_API_KEY")
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()
os.environ["SSL_CERT_FILE"] = certifi.where()
tiktoken_cache_dir = "./tiktoken_cache/"
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir
os.environ['LANGCHAIN_TRACING_V2'] = os.getenv("LANGCHAIN_TRACING_V2")
os.environ['LANGCHAIN_ENDPOINT'] = os.getenv("LANGCHAIN_ENDPOINT")
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_community.chat_models import AzureChatOpenAI

Data model

class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""

    datasource: Literal["python_docs", "js_docs", "golang_docs"] = Field(
        ...,
        description="Given a user question choose which datasource would be most relevant for answering their question",
    )

LLM with function call

from langchain.schema import HumanMessage
llm = AzureChatOpenAI(azure_endpoint=api_base, api_key=api_key, model="gpt-35-turbo-0613", temperature=0)

# Test Azure connectivity with below code
# message = HumanMessage(
#     content="Translate this sentence from English to Spanish.MS Dhoni as the greatest finisher in the history of the sport"
# )    
# print(llm([message])) 

structured_llm = llm.with_structured_output(RouteQuery)

When I initiate structured_llm, it gives me below error:

Traceback:

`{
	"name": "NotImplementedError",
	"message": "",
	"stack": "---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[19], line 1
----> 1 structured_llm = llm.with_structured_output(RouteQuery)

File c:\\Users\\M307900\\Documents\\rcr_catalogue\\.venv\\lib\\site-packages\\langchain_core\\_api\\beta_decorator.py:110, in beta.<locals>.beta.<locals>.warning_emitting_wrapper(*args, **kwargs)
    108     warned = True
    109     emit_warning()
--> 110 return wrapped(*args, **kwargs)

File c:\\Users\\M307900\\Documents\\rcr_catalogue\\.venv\\lib\\site-packages\\langchain_core\\language_models\\base.py:204, in BaseLanguageModel.with_structured_output(self, schema, **kwargs)
    199 @beta()
    200 def with_structured_output(
    201     self, schema: Union[Dict, Type[BaseModel]], **kwargs: Any
    202 ) -> Runnable[LanguageModelInput, Union[Dict, BaseModel]]:
    203     \"\"\"Implement this if there is a way of steering the model to generate responses that match a given schema.\"\"\"  # noqa: E501
--> 204     raise NotImplementedError()

NotImplementedError: "
}`

Is there something I can do to fix it?

Duplicated records in Chrome vectorstore after multiple cell executions

Hi @rlancemartin,

First of all thanks a lot for this series of lessons!

Probably it is known fact but for me it was not clearly for the first time when I found it, that if we run the cell this code from your Jupyter notebook for Lessons 1-4 multiple (for example, k) times:

vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Then there will be k duplicated records for each original record, because this method added documents even if collection already exists.
We can check it using this code for example:

vectorstore_data = vectorstore.get()
print(len(vectorstore_data['documents']))

As I remember, I saw similar behavior for langchain wrapper of Weaviate database.

So as a quick workaround we can remove default collection (which has name "langchain") before we add documents:

collection_name = 'langchain'
Chroma(collection_name=collection_name).delete_collection()
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Since there are no warnings or errors about existing collection, this feature may not be immediately noticed, so I hope it will be useful to someone.

P.S. I also noticed that during Part 4 here we can see that 4 documents are retrieved where 2 of them are duplicates of another ones.

Thank you.

Notes about code and presentation for Parts 10-11

Hello @rlancemartin,

I made some notes while exploring Parts 10-11 and probably it will be useful:

In[17]: type(out)

Tutorials are awesome again, thank you!

Query Construction generates the wrong filter on the publish_date metadata filter

https://github.com/langchain-ai/rag-from-scratch/blob/abafe332aea1b841989b830a494f97634ecbe5f0/rag_from_scratch_10_and_11.ipynb#L626

image

For video's published before 2024, the filter should be latest_publish_date and not earliest_publish_date.

Is there anything that could be done in LangChain to improve the performance and give the correct response? Or is it a case of trying a different LLM that might perform better?

Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.