Sweep: Add support for customizing the location where the ChromaDB store the index,about hamero/embedchain

Comments (1)

sweep-ai commented on May 17, 2024

I've started working on your request to add support for customizing the location where the ChromaDB stores the index. The plan is to modify the constructors of the EmbedChain and ChromaDB classes to accept an additional parameter for the custom location. This will allow you to specify the location when creating an instance of the EmbedChain class. I'll also update the README with instructions on how to use this new feature.

Give me a minute!

Some code snippets I looked at (click to expand). If some file is missing from here, you can mention the path in the ticket description.

embedchain/embedchain/vectordb/chroma_db.py

Lines 1 to 32 in 77c8a32

 import chromadb 

 import os 

 from chromadb.utils import embedding_functions 

 from embedchain.vectordb.base_vector_db import BaseVectorDB 

 openai_ef = embedding_functions.OpenAIEmbeddingFunction( 

 api_key=os.getenv("OPENAI_API_KEY"), 

 organization_id=os.getenv("OPENAI_ORGANIZATION"), 

 model_name="text-embedding-ada-002" 

 ) 

 class ChromaDB(BaseVectorDB): 

 def __init__(self, db_dir=None, ef=None): 

 self.ef = ef if ef is not None else openai_ef 

 if db_dir is None: 

 db_dir = "db" 

 self.client_settings = chromadb.config.Settings( 

 chroma_db_impl="duckdb+parquet", 

 persist_directory=db_dir, 

 anonymized_telemetry=False 

 ) 

 super().__init__() 

 def _get_or_create_db(self): 

 return chromadb.Client(self.client_settings) 

 def _get_or_create_collection(self): 

 return self.client.get_or_create_collection( 

 'embedchain_store', embedding_function=self.ef, 

 )

embedchain/README.md

Lines 1 to 265 in 77c8a32

 # embedchain 

 [![](https://dcbadge.vercel.app/api/server/nhvCbCtKV?style=flat)](https://discord.gg/nhvCbCtKV) 

 [![PyPI](https://img.shields.io/pypi/v/embedchain)](https://pypi.org/project/embedchain/) 

 embedchain is a framework to easily create LLM powered bots over any dataset. If you want a javascript version, check out [embedchain-js](https://github.com/embedchain/embedchainjs) 

 # Latest Updates 

 * Introduce a new app type called `OpenSourceApp`. It uses `gpt4all` as the LLM and `sentence transformers` all-MiniLM-L6-v2 as the embedding model. If you use this app, you dont have to pay for anything. 

 # What is embedchain? 

 Embedchain abstracts the entire process of loading a dataset, chunking it, creating embeddings and then storing in a vector database. 

 You can add a single or multiple dataset using `.add` and `.add_local` function and then use `.query` function to find an answer from the added datasets. 

 If you want to create a Naval Ravikant bot which has 1 youtube video, 1 book as pdf and 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the videos, pdf and blog posts and the QnA pair and embedchain will create a bot for you. 

 ```python 

 from embedchain import App 

 naval_chat_bot = App() 

 # Embed Online Resources 

 naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") 

 naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") 

 naval_chat_bot.add("web_page", "https://nav.al/feedback") 

 naval_chat_bot.add("web_page", "https://nav.al/agi") 

 # Embed Local Resources 

 naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor.")) 

 naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?") 

 # answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality. 

 ``` 

 # Getting Started 

 ## Installation 

 First make sure that you have the package installed. If not, then install it using `pip` 

 ```bash 

 pip install embedchain 

 ``` 

 ## Usage 

 Creating a chatbot involves 3 steps: 

 - import the App instance 

 - add dataset 

 - query on the dataset and get answers 

 ### App Types 

 We have two types of App. 

 #### 1. App (uses OpenAI models, paid) 

 ```python 

 from embedchain import App 

 naval_chat_bot = App() 

 ``` 

 * `App` uses OpenAI's model, so these are paid models. You will be charged for embedding model usage and LLM usage. 

 * `App` uses OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visiting [this link](https://platform.openai.com/account/api-keys). 

 * Once you have the API key, set it in an environment variable called `OPENAI_API_KEY` 

 ```python 

 import os 

 os.environ["OPENAI_API_KEY"] = "sk-xxxx" 

 ``` 

 #### 2. OpenSourceApp (uses opensource models, free) 

 ```python 

 from embedchain import OpenSourceApp 

 naval_chat_bot = OpenSourceApp() 

 ``` 

 * `OpenSourceApp` uses open source embedding and LLM model. It uses `all-MiniLM-L6-v2` from Sentence Transformers library as the embedding model and `gpt4all` as the LLM. 

 * Here there is no need to setup any api keys. You just need to install embedchain package and these will get automatically installed. 

 * Once you have imported and instantiated the app, every functionality from here onwards is the same for either type of app. 

 ### Add data set and query 

 * This step assumes that you have already created an `app` instance by either using `App` or `OpenSourceApp`. We are calling our app instance as `naval_chat_bot` 

 * Now use `.add` function to add any dataset. 

 ```python 

 # naval_chat_bot = App() or 

 # naval_chat_bot = OpenSourceApp() 

 # Embed Online Resources 

 naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") 

 naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") 

 naval_chat_bot.add("web_page", "https://nav.al/feedback") 

 naval_chat_bot.add("web_page", "https://nav.al/agi") 

 # Embed Local Resources 

 naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor.")) 

 ``` 

 * If there is any other app instance in your script or app, you can change the import as 

 ```python 

 from embedchain import App as EmbedChainApp 

 from embedchain import OpenSourceApp as EmbedChainOSApp 

 # or 

 from embedchain import App as ECApp 

 from embedchain import OpenSourceApp as ECOSApp 

 ``` 

 * Now your app is created. You can use `.query` function to get the answer for any query. 

 ```python 

 print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")) 

 # answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality. 

 ``` 

 ## Format supported 

 We support the following formats: 

 ### Youtube Video 

 To add any youtube video to your app, use the data_type (first argument to `.add`) as `youtube_video`. Eg: 

 ```python 

 app.add('youtube_video', 'a_valid_youtube_url_here') 

 ``` 

 ### PDF File 

 To add any pdf file, use the data_type as `pdf_file`. Eg: 

 ```python 

 app.add('pdf_file', 'a_valid_url_where_pdf_file_can_be_accessed') 

 ``` 

 Note that we do not support password protected pdfs. 

 ### Web Page 

 To add any web page, use the data_type as `web_page`. Eg: 

 ```python 

 app.add('web_page', 'a_valid_web_page_url') 

 ``` 

 ### Text 

 To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg: 

 ```python 

 app.add_local('text', 'Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.') 

 ``` 

 Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit. 

 ### QnA Pair 

 To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg: 

 ```python 

 app.add_local('qna_pair', ("Question", "Answer")) 

 ``` 

 ### Reusing a Vector DB 

 Default behavior is to create a persistent vector DB in the directory **./db**. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface. 

 Create a local index: 

 ```python 

 from embedchain import App 

 naval_chat_bot = App() 

 naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") 

 naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") 

 ``` 

 You can reuse the local index with the same code, but without adding new documents: 

 ```python 

 from embedchain import App 

 naval_chat_bot = App() 

 print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")) 

 ``` 

 ### More Formats coming soon 

 * If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchain/issues) and we will add it to the list of supported formats. 

 # How does it work? 

 Creating a chat bot over any dataset needs the following steps to happen 

 * load the data 

 * create meaningful chunks 

 * create embeddings for each chunk 

 * store the chunks in vector database 

 Whenever a user asks any query, following process happens to find the answer for the query 

 * create the embedding for query 

 * find similar documents for this query from vector database 

 * pass similar documents as context to LLM to get the final answer. 

 The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own. 

 * How should I chunk the data? What is a meaningful chunk size? 

 * How should I create embeddings for each chunk? Which embedding model should I use? 

 * How should I store the chunks in vector database? Which vector database should I use? 

 * Should I store meta data along with the embeddings? 

 * How should I find similar documents for a query? Which ranking model should I use? 

 These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers. 

 embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset. 

 In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer. 

 # Tech Stack 

 embedchain is built on the following stack: 

 - [Langchain](https://github.com/hwchase17/langchain) as an LLM framework to load, chunk and index data 

 - [OpenAI's Ada embedding model](https://platform.openai.com/docs/guides/embeddings) to create embeddings 

 - [OpenAI's ChatGPT API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) as LLM to get answers given the context 

 - [Chroma](https://github.com/chroma-core/chroma) as the vector database to store embeddings 

 - [gpt4all](https://github.com/nomic-ai/gpt4all) as an open source LLM 

 - [sentence-transformers](https://huggingface.co/sentence-transformers) as open source embedding model 

 # Author 

 * Taranjeet Singh ([@taranjeetio](https://twitter.com/taranjeetio)) 

 ## Citation 

 If you utilize this repository, please consider citing it with: 

 ``` 

 @misc{embedchain, 

  author = {Taranjeet Singh}, 

  title = {Embechain: Framework to easily create LLM powered bots over any dataset}, 

  year = {2023}, 

  publisher = {GitHub}, 

  journal = {GitHub repository}, 

  howpublished = {\url{https://github.com/embedchain/embedchain}}, 

 }

embedchain/embedchain/embedchain.py

Lines 1 to 204 in 77c8a32

 import os 

 from chromadb.utils import embedding_functions 

 from dotenv import load_dotenv 

 from gpt4all import GPT4All 

 from langchain.docstore.document import Document 

 from langchain.embeddings.openai import OpenAIEmbeddings 

 from embedchain.loaders.youtube_video import YoutubeVideoLoader 

 from embedchain.loaders.pdf_file import PdfFileLoader 

 from embedchain.loaders.web_page import WebPageLoader 

 from embedchain.loaders.local_qna_pair import LocalQnaPairLoader 

 from embedchain.loaders.local_text import LocalTextLoader 

 from embedchain.chunkers.youtube_video import YoutubeVideoChunker 

 from embedchain.chunkers.pdf_file import PdfFileChunker 

 from embedchain.chunkers.web_page import WebPageChunker 

 from embedchain.chunkers.qna_pair import QnaPairChunker 

 from embedchain.chunkers.text import TextChunker 

 from embedchain.vectordb.chroma_db import ChromaDB 

 openai_ef = embedding_functions.OpenAIEmbeddingFunction( 

 api_key=os.getenv("OPENAI_API_KEY"), 

 organization_id=os.getenv("OPENAI_ORGANIZATION"), 

 model_name="text-embedding-ada-002" 

 ) 

 sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") 

 gpt4all_model = None 

 load_dotenv() 

 ABS_PATH = os.getcwd() 

 DB_DIR = os.path.join(ABS_PATH, "db") 

 class EmbedChain: 

 def __init__(self, db=None, ef=None): 

 """ 

  Initializes the EmbedChain instance, sets up a vector DB client and 

  creates a collection. 

  :param db: The instance of the VectorDB subclass. 

  """ 

 if db is None: 

 db = ChromaDB(ef=ef) 

 self.db_client = db.client 

 self.collection = db.collection 

 self.user_asks = [] 

 def _get_loader(self, data_type): 

 """ 

  Returns the appropriate data loader for the given data type. 

  :param data_type: The type of the data to load. 

  :return: The loader for the given data type. 

  :raises ValueError: If an unsupported data type is provided. 

  """ 

 loaders = { 

 'youtube_video': YoutubeVideoLoader(), 

 'pdf_file': PdfFileLoader(), 

 'web_page': WebPageLoader(), 

 'qna_pair': LocalQnaPairLoader(), 

 'text': LocalTextLoader(), 

 } 

 if data_type in loaders: 

 return loaders[data_type] 

 else: 

 raise ValueError(f"Unsupported data type: {data_type}") 

 def _get_chunker(self, data_type): 

 """ 

  Returns the appropriate chunker for the given data type. 

  :param data_type: The type of the data to chunk. 

  :return: The chunker for the given data type. 

  :raises ValueError: If an unsupported data type is provided. 

  """ 

 chunkers = { 

 'youtube_video': YoutubeVideoChunker(), 

 'pdf_file': PdfFileChunker(), 

 'web_page': WebPageChunker(), 

 'qna_pair': QnaPairChunker(), 

 'text': TextChunker(), 

 } 

 if data_type in chunkers: 

 return chunkers[data_type] 

 else: 

 raise ValueError(f"Unsupported data type: {data_type}") 

 def add(self, data_type, url): 

 """ 

  Adds the data from the given URL to the vector db. 

  Loads the data, chunks it, create embedding for each chunk 

  and then stores the embedding to vector database. 

  :param data_type: The type of the data to add. 

  :param url: The URL where the data is located. 

  """ 

 loader = self._get_loader(data_type) 

 chunker = self._get_chunker(data_type) 

 self.user_asks.append([data_type, url]) 

 self.load_and_embed(loader, chunker, url) 

 def add_local(self, data_type, content): 

 """ 

  Adds the data you supply to the vector db. 

  Loads the data, chunks it, create embedding for each chunk 

  and then stores the embedding to vector database. 

  :param data_type: The type of the data to add. 

  :param content: The local data. Refer to the `README` for formatting. 

  """ 

 loader = self._get_loader(data_type) 

 chunker = self._get_chunker(data_type) 

 self.user_asks.append([data_type, content]) 

 self.load_and_embed(loader, chunker, content) 

 def load_and_embed(self, loader, chunker, url): 

 """ 

  Loads the data from the given URL, chunks it, and adds it to the database. 

  :param loader: The loader to use to load the data. 

  :param chunker: The chunker to use to chunk the data. 

  :param url: The URL where the data is located. 

  """ 

 embeddings_data = chunker.create_chunks(loader, url) 

 documents = embeddings_data["documents"] 

 metadatas = embeddings_data["metadatas"] 

 ids = embeddings_data["ids"] 

 # get existing ids, and discard doc if any common id exist. 

 existing_docs = self.collection.get( 

 ids=ids, 

 # where={"url": url} 

 ) 

 existing_ids = set(existing_docs["ids"]) 

 if len(existing_ids): 

 data_dict = {id: (doc, meta) for id, doc, meta in zip(ids, documents, metadatas)} 

 data_dict = {id: value for id, value in data_dict.items() if id not in existing_ids} 

 if not data_dict: 

 print(f"All data from {url} already exists in the database.") 

 return 

 ids = list(data_dict.keys()) 

 documents, metadatas = zip(*data_dict.values()) 

 self.collection.add( 

 documents=documents, 

 metadatas=metadatas, 

 ids=ids 

 ) 

 print(f"Successfully saved {url}. Total chunks count: {self.collection.count()}") 

 def _format_result(self, results): 

 return [ 

 (Document(page_content=result[0], metadata=result[1] or {}), result[2]) 

 for result in zip( 

 results["documents"][0], 

 results["metadatas"][0], 

 results["distances"][0], 

 ) 

 ] 

 def get_llm_model_answer(self, prompt): 

 raise NotImplementedError 

 def retrieve_from_database(self, input_query): 

 """ 

  Queries the vector database based on the given input query. 

  Gets relevant doc based on the query 

  :param input_query: The query to use. 

  :return: The content of the document that matched your query. 

  """ 

 result = self.collection.query( 

 query_texts=[input_query,], 

 n_results=1, 

 ) 

 result_formatted = self._format_result(result) 

 if result_formatted: 

 content = result_formatted[0][0].page_content 

 else: 

 content = "" 

 return content 

 def generate_prompt(self, input_query, context): 

 """ 

  Generates a prompt based on the given query and context, ready to be passed to an LLM 

  :param input_query: The query to use. 

  :param context: Similar documents to the query used as context. 

  :return: The prompt 

  """ 

 prompt = f"""Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. 

  {context} 

  Query: {input_query} 

  Helpful Answer: 

  """ 

 return prompt 

 def get_answer_from_llm(self, prompt): 

 """ 

  Gets an answer based on the given query and context by passing it

embedchain/LICENSE

Lines 20 to 194 in 77c8a32

 otherwise, or (ii) ownership of fifty percent (50%) or more of the 

 outstanding shares, or (iii) beneficial ownership of such entity. 

 "You" (or "Your") shall mean an individual or Legal Entity 

 exercising permissions granted by this License. 

 "Source" form shall mean the preferred form for making modifications, 

 including but not limited to software source code, documentation 

 source, and configuration files. 

 "Object" form shall mean any form resulting from mechanical 

 transformation or translation of a Source form, including but 

 not limited to compiled object code, generated documentation, 

 and conversions to other media types. 

 "Work" shall mean the work of authorship, whether in Source or 

 Object form, made available under the License, as indicated by a 

 copyright notice that is included in or attached to the work 

 (an example is provided in the Appendix below). 

 "Derivative Works" shall mean any work, whether in Source or Object 

 form, that is based on (or derived from) the Work and for which the 

 editorial revisions, annotations, elaborations, or other modifications 

 represent, as a whole, an original work of authorship. For the purposes 

 of this License, Derivative Works shall not include works that remain 

 separable from, or merely link (or bind by name) to the interfaces of, 

 the Work and Derivative Works thereof. 

 "Contribution" shall mean any work of authorship, including 

 the original version of the Work and any modifications or additions 

 to that Work or Derivative Works thereof, that is intentionally 

 submitted to Licensor for inclusion in the Work by the copyright owner 

 or by an individual or Legal Entity authorized to submit on behalf of 

 the copyright owner. For the purposes of this definition, "submitted" 

 means any form of electronic, verbal, or written communication sent 

 to the Licensor or its representatives, including but not limited to 

 communication on electronic mailing lists, source code control systems, 

 and issue tracking systems that are managed by, or on behalf of, the 

 Licensor for the purpose of discussing and improving the Work, but 

 excluding communication that is conspicuously marked or otherwise 

 designated in writing by the copyright owner as "Not a Contribution." 

 "Contributor" shall mean Licensor and any individual or Legal Entity 

 on behalf of whom a Contribution has been received by Licensor and 

 subsequently incorporated within the Work. 

 2. Grant of Copyright License. Subject to the terms and conditions of 

 this License, each Contributor hereby grants to You a perpetual, 

 worldwide, non-exclusive, no-charge, royalty-free, irrevocable 

 copyright license to reproduce, prepare Derivative Works of, 

 publicly display, publicly perform, sublicense, and distribute the 

 Work and such Derivative Works in Source or Object form. 

 3. Grant of Patent License. Subject to the terms and conditions of 

 this License, each Contributor hereby grants to You a perpetual, 

 worldwide, non-exclusive, no-charge, royalty-free, irrevocable 

 (except as stated in this section) patent license to make, have made, 

 use, offer to sell, sell, import, and otherwise transfer the Work, 

 where such license applies only to those patent claims licensable 

 by such Contributor that are necessarily infringed by their 

 Contribution(s) alone or by combination of their Contribution(s) 

 with the Work to which such Contribution(s) was submitted. If You 

 institute patent litigation against any entity (including a 

 cross-claim or counterclaim in a lawsuit) alleging that the Work 

 or a Contribution incorporated within the Work constitutes direct 

 or contributory patent infringement, then any patent licenses 

 granted to You under this License for that Work shall terminate 

 as of the date such litigation is filed. 

 4. Redistribution. You may reproduce and distribute copies of the 

 Work or Derivative Works thereof in any medium, with or without 

 modifications, and in Source or Object form, provided that You 

 meet the following conditions: 

 (a) You must give any other recipients of the Work or 

 Derivative Works a copy of this License; and 

 (b) You must cause any modified files to carry prominent notices 

 stating that You changed the files; and 

 (c) You must retain, in the Source form of any Derivative Works 

 that You distribute, all copyright, patent, trademark, and 

 attribution notices from the Source form of the Work, 

 excluding those notices that do not pertain to any part of 

 the Derivative Works; and 

 (d) If the Work includes a "NOTICE" text file as part of its 

 distribution, then any Derivative Works that You distribute must 

 include a readable copy of the attribution notices contained 

 within such NOTICE file, excluding those notices that do not 

 pertain to any part of the Derivative Works, in at least one 

 of the following places: within a NOTICE text file distributed 

 as part of the Derivative Works; within the Source form or 

 documentation, if provided along with the Derivative Works; or, 

 within a display generated by the Derivative Works, if and 

 wherever such third-party notices normally appear. The contents 

 of the NOTICE file are for informational purposes only and 

 do not modify the License. You may add Your own attribution 

 notices within Derivative Works that You distribute, alongside 

 or as an addendum to the NOTICE text from the Work, provided 

 that such additional attribution notices cannot be construed 

 as modifying the License. 

 You may add Your own copyright statement to Your modifications and 

 may provide additional or different license terms and conditions 

 for use, reproduction, or distribution of Your modifications, or 

 for any such Derivative Works as a whole, provided Your use, 

 reproduction, and distribution of the Work otherwise complies with 

 the conditions stated in this License. 

 5. Submission of Contributions. Unless You explicitly state otherwise, 

 any Contribution intentionally submitted for inclusion in the Work 

 by You to the Licensor shall be under the terms and conditions of 

 this License, without any additional terms or conditions. 

 Notwithstanding the above, nothing herein shall supersede or modify 

 the terms of any separate license agreement you may have executed 

 with Licensor regarding such Contributions. 

 6. Trademarks. This License does not grant permission to use the trade 

 names, trademarks, service marks, or product names of the Licensor, 

 except as required for reasonable and customary use in describing the 

 origin of the Work and reproducing the content of the NOTICE file. 

 7. Disclaimer of Warranty. Unless required by applicable law or 

 agreed to in writing, Licensor provides the Work (and each 

 Contributor provides its Contributions) on an "AS IS" BASIS, 

 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 

 implied, including, without limitation, any warranties or conditions 

 of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 

 PARTICULAR PURPOSE. You are solely responsible for determining the 

 appropriateness of using or redistributing the Work and assume any 

 risks associated with Your exercise of permissions under this License. 

 8. Limitation of Liability. In no event and under no legal theory, 

 whether in tort (including negligence), contract, or otherwise, 

 unless required by applicable law (such as deliberate and grossly 

 negligent acts) or agreed to in writing, shall any Contributor be 

 liable to You for damages, including any direct, indirect, special, 

 incidental, or consequential damages of any character arising as a 

 result of this License or out of the use or inability to use the 

 Work (including but not limited to damages for loss of goodwill, 

 work stoppage, computer failure or malfunction, or any and all 

 other commercial damages or losses), even if such Contributor 

 has been advised of the possibility of such damages. 

 9. Accepting Warranty or Additional Liability. While redistributing 

 the Work or Derivative Works thereof, You may choose to offer, 

 and charge a fee for, acceptance of support, warranty, indemnity, 

 or other liability obligations and/or rights consistent with this 

 License. However, in accepting such obligations, You may act only 

 on Your own behalf and on Your sole responsibility, not on behalf 

 of any other Contributor, and only if You agree to indemnify, 

 defend, and hold each Contributor harmless for any liability 

 incurred by, or claims asserted against, such Contributor by reason 

 of your accepting any such warranty or additional liability. 

 END OF TERMS AND CONDITIONS 

 APPENDIX: How to apply the Apache License to your work. 

 To apply the Apache License to your work, attach the following 

 boilerplate notice, with the fields enclosed by brackets "[]" 

 replaced with your own identifying information. (Don't include 

 the brackets!) The text should be enclosed in the appropriate 

 comment syntax for the file format. We also recommend that a 

 file or class name and description of purpose be included on the 

 same "printed page" as the copyright notice for easier 

 identification within third-party archives. 

 Copyright [2023] [Taranjeet Singh] 

 Licensed under the Apache License, Version 2.0 (the "License"); 

 you may not use this file except in compliance with the License. 

 You may obtain a copy of the License at

embedchain/embedchain/loaders/pdf_file.py

Lines 1 to 23 in 77c8a32

 from langchain.document_loaders import PyPDFLoader 

 from embedchain.utils import clean_string 

 class PdfFileLoader: 

 def load_data(self, url): 

 loader = PyPDFLoader(url) 

 output = [] 

 pages = loader.load_and_split() 

 if not len(pages): 

 raise ValueError("No data found") 

 for page in pages: 

 content = page.page_content 

 content = clean_string(content) 

 meta_data = page.metadata 

 meta_data["url"] = url 

 output.append({ 

 "content": content, 

 "meta_data": meta_data, 

 }) 

 return output

I'm a bot that handles simple bugs and feature requests but I might make mistakes. Please be kind!

from embedchain.

Sweep: Add support for customizing the location where the ChromaDB store the index about embedchain HOT 1 CLOSED

Comments (1)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	import chromadb
	import os

	from chromadb.utils import embedding_functions

	from embedchain.vectordb.base_vector_db import BaseVectorDB

	openai_ef = embedding_functions.OpenAIEmbeddingFunction(
	api_key=os.getenv("OPENAI_API_KEY"),
	organization_id=os.getenv("OPENAI_ORGANIZATION"),
	model_name="text-embedding-ada-002"
	)

	class ChromaDB(BaseVectorDB):
	def __init__(self, db_dir=None, ef=None):
	self.ef = ef if ef is not None else openai_ef
	if db_dir is None:
	db_dir = "db"
	self.client_settings = chromadb.config.Settings(
	chroma_db_impl="duckdb+parquet",
	persist_directory=db_dir,
	anonymized_telemetry=False
	)
	super().__init__()

	def _get_or_create_db(self):
	return chromadb.Client(self.client_settings)

	def _get_or_create_collection(self):
	return self.client.get_or_create_collection(
	'embedchain_store', embedding_function=self.ef,
	)

	# embedchain

	[![](https://dcbadge.vercel.app/api/server/nhvCbCtKV?style=flat)](https://discord.gg/nhvCbCtKV)
	[![PyPI](https://img.shields.io/pypi/v/embedchain)](https://pypi.org/project/embedchain/)

	embedchain is a framework to easily create LLM powered bots over any dataset. If you want a javascript version, check out [embedchain-js](https://github.com/embedchain/embedchainjs)

	# Latest Updates

	* Introduce a new app type called `OpenSourceApp`. It uses `gpt4all` as the LLM and `sentence transformers` all-MiniLM-L6-v2 as the embedding model. If you use this app, you dont have to pay for anything.

	# What is embedchain?

	Embedchain abstracts the entire process of loading a dataset, chunking it, creating embeddings and then storing in a vector database.

	You can add a single or multiple dataset using `.add` and `.add_local` function and then use `.query` function to find an answer from the added datasets.

	If you want to create a Naval Ravikant bot which has 1 youtube video, 1 book as pdf and 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the videos, pdf and blog posts and the QnA pair and embedchain will create a bot for you.

	```python

	from embedchain import App

	naval_chat_bot = App()

	# Embed Online Resources
	naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
	naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
	naval_chat_bot.add("web_page", "https://nav.al/feedback")
	naval_chat_bot.add("web_page", "https://nav.al/agi")

	# Embed Local Resources
	naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor."))

	naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")
	# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
	```

	# Getting Started

	## Installation

	First make sure that you have the package installed. If not, then install it using `pip`

	```bash
	pip install embedchain
	```

	## Usage

	Creating a chatbot involves 3 steps:

	- import the App instance
	- add dataset
	- query on the dataset and get answers

	### App Types

	We have two types of App.

	#### 1. App (uses OpenAI models, paid)

	```python
	from embedchain import App

	naval_chat_bot = App()
	```

	* `App` uses OpenAI's model, so these are paid models. You will be charged for embedding model usage and LLM usage.

	* `App` uses OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visiting [this link](https://platform.openai.com/account/api-keys).

	* Once you have the API key, set it in an environment variable called `OPENAI_API_KEY`

	```python
	import os
	os.environ["OPENAI_API_KEY"] = "sk-xxxx"
	```

	#### 2. OpenSourceApp (uses opensource models, free)

	```python
	from embedchain import OpenSourceApp

	naval_chat_bot = OpenSourceApp()
	```

	* `OpenSourceApp` uses open source embedding and LLM model. It uses `all-MiniLM-L6-v2` from Sentence Transformers library as the embedding model and `gpt4all` as the LLM.

	* Here there is no need to setup any api keys. You just need to install embedchain package and these will get automatically installed.

	* Once you have imported and instantiated the app, every functionality from here onwards is the same for either type of app.

	### Add data set and query

	* This step assumes that you have already created an `app` instance by either using `App` or `OpenSourceApp`. We are calling our app instance as `naval_chat_bot`

	* Now use `.add` function to add any dataset.

	```python

	# naval_chat_bot = App() or
	# naval_chat_bot = OpenSourceApp()

	# Embed Online Resources
	naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
	naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
	naval_chat_bot.add("web_page", "https://nav.al/feedback")
	naval_chat_bot.add("web_page", "https://nav.al/agi")

	# Embed Local Resources
	naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor."))
	```

	* If there is any other app instance in your script or app, you can change the import as

	```python
	from embedchain import App as EmbedChainApp
	from embedchain import OpenSourceApp as EmbedChainOSApp

	# or

	from embedchain import App as ECApp
	from embedchain import OpenSourceApp as ECOSApp
	```

	* Now your app is created. You can use `.query` function to get the answer for any query.

	```python
	print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
	# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
	```

	## Format supported

	We support the following formats:

	### Youtube Video

	To add any youtube video to your app, use the data_type (first argument to `.add`) as `youtube_video`. Eg:

	```python
	app.add('youtube_video', 'a_valid_youtube_url_here')
	```

	### PDF File

	To add any pdf file, use the data_type as `pdf_file`. Eg:

	```python
	app.add('pdf_file', 'a_valid_url_where_pdf_file_can_be_accessed')
	```

	Note that we do not support password protected pdfs.

	### Web Page

	To add any web page, use the data_type as `web_page`. Eg:

	```python
	app.add('web_page', 'a_valid_web_page_url')
	```

	### Text

	To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg:

	```python
	app.add_local('text', 'Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.')
	```
	Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit.

	### QnA Pair

	To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg:

	```python
	app.add_local('qna_pair', ("Question", "Answer"))
	```

	### Reusing a Vector DB

	Default behavior is to create a persistent vector DB in the directory ./db. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface.

	Create a local index:

	```python

	from embedchain import App

	naval_chat_bot = App()
	naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44")
	naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
	```

	You can reuse the local index with the same code, but without adding new documents:

	```python

	from embedchain import App

	naval_chat_bot = App()
	print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
	```

	### More Formats coming soon

	* If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchain/issues) and we will add it to the list of supported formats.

	# How does it work?

	Creating a chat bot over any dataset needs the following steps to happen

	* load the data
	* create meaningful chunks
	* create embeddings for each chunk
	* store the chunks in vector database

	Whenever a user asks any query, following process happens to find the answer for the query

	* create the embedding for query
	* find similar documents for this query from vector database
	* pass similar documents as context to LLM to get the final answer.

	The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own.

	* How should I chunk the data? What is a meaningful chunk size?
	* How should I create embeddings for each chunk? Which embedding model should I use?
	* How should I store the chunks in vector database? Which vector database should I use?
	* Should I store meta data along with the embeddings?
	* How should I find similar documents for a query? Which ranking model should I use?

	These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers.

	embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset.

	In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer.

	# Tech Stack

	embedchain is built on the following stack:

	- [Langchain](https://github.com/hwchase17/langchain) as an LLM framework to load, chunk and index data
	- [OpenAI's Ada embedding model](https://platform.openai.com/docs/guides/embeddings) to create embeddings
	- [OpenAI's ChatGPT API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) as LLM to get answers given the context
	- [Chroma](https://github.com/chroma-core/chroma) as the vector database to store embeddings
	- [gpt4all](https://github.com/nomic-ai/gpt4all) as an open source LLM
	- [sentence-transformers](https://huggingface.co/sentence-transformers) as open source embedding model

	# Author

	* Taranjeet Singh ([@taranjeetio](https://twitter.com/taranjeetio))

	## Citation

	If you utilize this repository, please consider citing it with:
	```
	@misc{embedchain,
	author = {Taranjeet Singh},
	title = {Embechain: Framework to easily create LLM powered bots over any dataset},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/embedchain/embedchain}},
	}

	import os

	from chromadb.utils import embedding_functions
	from dotenv import load_dotenv
	from gpt4all import GPT4All
	from langchain.docstore.document import Document
	from langchain.embeddings.openai import OpenAIEmbeddings

	from embedchain.loaders.youtube_video import YoutubeVideoLoader
	from embedchain.loaders.pdf_file import PdfFileLoader
	from embedchain.loaders.web_page import WebPageLoader
	from embedchain.loaders.local_qna_pair import LocalQnaPairLoader
	from embedchain.loaders.local_text import LocalTextLoader
	from embedchain.chunkers.youtube_video import YoutubeVideoChunker
	from embedchain.chunkers.pdf_file import PdfFileChunker
	from embedchain.chunkers.web_page import WebPageChunker
	from embedchain.chunkers.qna_pair import QnaPairChunker
	from embedchain.chunkers.text import TextChunker
	from embedchain.vectordb.chroma_db import ChromaDB

	openai_ef = embedding_functions.OpenAIEmbeddingFunction(
	api_key=os.getenv("OPENAI_API_KEY"),
	organization_id=os.getenv("OPENAI_ORGANIZATION"),
	model_name="text-embedding-ada-002"
	)
	sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

	gpt4all_model = None

	load_dotenv()

	ABS_PATH = os.getcwd()
	DB_DIR = os.path.join(ABS_PATH, "db")


	class EmbedChain:
	def __init__(self, db=None, ef=None):
	"""
	Initializes the EmbedChain instance, sets up a vector DB client and
	creates a collection.

	:param db: The instance of the VectorDB subclass.
	"""
	if db is None:
	db = ChromaDB(ef=ef)
	self.db_client = db.client
	self.collection = db.collection
	self.user_asks = []

	def _get_loader(self, data_type):
	"""
	Returns the appropriate data loader for the given data type.

	:param data_type: The type of the data to load.
	:return: The loader for the given data type.
	:raises ValueError: If an unsupported data type is provided.
	"""
	loaders = {
	'youtube_video': YoutubeVideoLoader(),
	'pdf_file': PdfFileLoader(),
	'web_page': WebPageLoader(),
	'qna_pair': LocalQnaPairLoader(),
	'text': LocalTextLoader(),
	}
	if data_type in loaders:
	return loaders[data_type]
	else:
	raise ValueError(f"Unsupported data type: {data_type}")

	def _get_chunker(self, data_type):
	"""
	Returns the appropriate chunker for the given data type.

	:param data_type: The type of the data to chunk.
	:return: The chunker for the given data type.
	:raises ValueError: If an unsupported data type is provided.
	"""
	chunkers = {
	'youtube_video': YoutubeVideoChunker(),
	'pdf_file': PdfFileChunker(),
	'web_page': WebPageChunker(),
	'qna_pair': QnaPairChunker(),
	'text': TextChunker(),
	}
	if data_type in chunkers:
	return chunkers[data_type]
	else:
	raise ValueError(f"Unsupported data type: {data_type}")

	def add(self, data_type, url):
	"""
	Adds the data from the given URL to the vector db.
	Loads the data, chunks it, create embedding for each chunk
	and then stores the embedding to vector database.

	:param data_type: The type of the data to add.
	:param url: The URL where the data is located.
	"""
	loader = self._get_loader(data_type)
	chunker = self._get_chunker(data_type)
	self.user_asks.append([data_type, url])
	self.load_and_embed(loader, chunker, url)

	def add_local(self, data_type, content):
	"""
	Adds the data you supply to the vector db.
	Loads the data, chunks it, create embedding for each chunk
	and then stores the embedding to vector database.

	:param data_type: The type of the data to add.
	:param content: The local data. Refer to the `README` for formatting.
	"""
	loader = self._get_loader(data_type)
	chunker = self._get_chunker(data_type)
	self.user_asks.append([data_type, content])
	self.load_and_embed(loader, chunker, content)

	def load_and_embed(self, loader, chunker, url):
	"""
	Loads the data from the given URL, chunks it, and adds it to the database.

	:param loader: The loader to use to load the data.
	:param chunker: The chunker to use to chunk the data.
	:param url: The URL where the data is located.
	"""
	embeddings_data = chunker.create_chunks(loader, url)
	documents = embeddings_data["documents"]
	metadatas = embeddings_data["metadatas"]
	ids = embeddings_data["ids"]
	# get existing ids, and discard doc if any common id exist.
	existing_docs = self.collection.get(
	ids=ids,
	# where={"url": url}
	)
	existing_ids = set(existing_docs["ids"])

	if len(existing_ids):
	data_dict = {id: (doc, meta) for id, doc, meta in zip(ids, documents, metadatas)}
	data_dict = {id: value for id, value in data_dict.items() if id not in existing_ids}

	if not data_dict:
	print(f"All data from {url} already exists in the database.")
	return

	ids = list(data_dict.keys())
	documents, metadatas = zip(*data_dict.values())

	self.collection.add(
	documents=documents,
	metadatas=metadatas,
	ids=ids
	)
	print(f"Successfully saved {url}. Total chunks count: {self.collection.count()}")

	def _format_result(self, results):
	return [
	(Document(page_content=result[0], metadata=result[1] or {}), result[2])
	for result in zip(
	results["documents"][0],
	results["metadatas"][0],
	results["distances"][0],
	)
	]

	def get_llm_model_answer(self, prompt):
	raise NotImplementedError

	def retrieve_from_database(self, input_query):
	"""
	Queries the vector database based on the given input query.
	Gets relevant doc based on the query

	:param input_query: The query to use.
	:return: The content of the document that matched your query.
	"""
	result = self.collection.query(
	query_texts=[input_query,],
	n_results=1,
	)
	result_formatted = self._format_result(result)
	if result_formatted:
	content = result_formatted[0][0].page_content
	else:
	content = ""
	return content

	def generate_prompt(self, input_query, context):
	"""
	Generates a prompt based on the given query and context, ready to be passed to an LLM

	:param input_query: The query to use.
	:param context: Similar documents to the query used as context.
	:return: The prompt
	"""
	prompt = f"""Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
	{context}
	Query: {input_query}
	Helpful Answer:
	"""
	return prompt

	def get_answer_from_llm(self, prompt):
	"""
	Gets an answer based on the given query and context by passing it

	otherwise, or (ii) ownership of fifty percent (50%) or more of the
	outstanding shares, or (iii) beneficial ownership of such entity.

	"You" (or "Your") shall mean an individual or Legal Entity
	exercising permissions granted by this License.

	"Source" form shall mean the preferred form for making modifications,
	including but not limited to software source code, documentation
	source, and configuration files.

	"Object" form shall mean any form resulting from mechanical
	transformation or translation of a Source form, including but
	not limited to compiled object code, generated documentation,
	and conversions to other media types.

	"Work" shall mean the work of authorship, whether in Source or
	Object form, made available under the License, as indicated by a
	copyright notice that is included in or attached to the work
	(an example is provided in the Appendix below).

	"Derivative Works" shall mean any work, whether in Source or Object
	form, that is based on (or derived from) the Work and for which the
	editorial revisions, annotations, elaborations, or other modifications
	represent, as a whole, an original work of authorship. For the purposes
	of this License, Derivative Works shall not include works that remain
	separable from, or merely link (or bind by name) to the interfaces of,
	the Work and Derivative Works thereof.

	"Contribution" shall mean any work of authorship, including
	the original version of the Work and any modifications or additions
	to that Work or Derivative Works thereof, that is intentionally
	submitted to Licensor for inclusion in the Work by the copyright owner
	or by an individual or Legal Entity authorized to submit on behalf of
	the copyright owner. For the purposes of this definition, "submitted"
	means any form of electronic, verbal, or written communication sent
	to the Licensor or its representatives, including but not limited to
	communication on electronic mailing lists, source code control systems,
	and issue tracking systems that are managed by, or on behalf of, the
	Licensor for the purpose of discussing and improving the Work, but
	excluding communication that is conspicuously marked or otherwise
	designated in writing by the copyright owner as "Not a Contribution."

	"Contributor" shall mean Licensor and any individual or Legal Entity
	on behalf of whom a Contribution has been received by Licensor and
	subsequently incorporated within the Work.

	2. Grant of Copyright License. Subject to the terms and conditions of
	this License, each Contributor hereby grants to You a perpetual,
	worldwide, non-exclusive, no-charge, royalty-free, irrevocable
	copyright license to reproduce, prepare Derivative Works of,
	publicly display, publicly perform, sublicense, and distribute the
	Work and such Derivative Works in Source or Object form.

	3. Grant of Patent License. Subject to the terms and conditions of
	this License, each Contributor hereby grants to You a perpetual,
	worldwide, non-exclusive, no-charge, royalty-free, irrevocable
	(except as stated in this section) patent license to make, have made,
	use, offer to sell, sell, import, and otherwise transfer the Work,
	where such license applies only to those patent claims licensable
	by such Contributor that are necessarily infringed by their
	Contribution(s) alone or by combination of their Contribution(s)
	with the Work to which such Contribution(s) was submitted. If You
	institute patent litigation against any entity (including a
	cross-claim or counterclaim in a lawsuit) alleging that the Work
	or a Contribution incorporated within the Work constitutes direct
	or contributory patent infringement, then any patent licenses
	granted to You under this License for that Work shall terminate
	as of the date such litigation is filed.

	4. Redistribution. You may reproduce and distribute copies of the
	Work or Derivative Works thereof in any medium, with or without
	modifications, and in Source or Object form, provided that You
	meet the following conditions:

	(a) You must give any other recipients of the Work or
	Derivative Works a copy of this License; and

	(b) You must cause any modified files to carry prominent notices
	stating that You changed the files; and

	(c) You must retain, in the Source form of any Derivative Works
	that You distribute, all copyright, patent, trademark, and
	attribution notices from the Source form of the Work,
	excluding those notices that do not pertain to any part of
	the Derivative Works; and

	(d) If the Work includes a "NOTICE" text file as part of its
	distribution, then any Derivative Works that You distribute must
	include a readable copy of the attribution notices contained
	within such NOTICE file, excluding those notices that do not
	pertain to any part of the Derivative Works, in at least one
	of the following places: within a NOTICE text file distributed
	as part of the Derivative Works; within the Source form or
	documentation, if provided along with the Derivative Works; or,
	within a display generated by the Derivative Works, if and
	wherever such third-party notices normally appear. The contents
	of the NOTICE file are for informational purposes only and
	do not modify the License. You may add Your own attribution
	notices within Derivative Works that You distribute, alongside
	or as an addendum to the NOTICE text from the Work, provided
	that such additional attribution notices cannot be construed
	as modifying the License.

	You may add Your own copyright statement to Your modifications and
	may provide additional or different license terms and conditions
	for use, reproduction, or distribution of Your modifications, or
	for any such Derivative Works as a whole, provided Your use,
	reproduction, and distribution of the Work otherwise complies with
	the conditions stated in this License.

	5. Submission of Contributions. Unless You explicitly state otherwise,
	any Contribution intentionally submitted for inclusion in the Work
	by You to the Licensor shall be under the terms and conditions of
	this License, without any additional terms or conditions.
	Notwithstanding the above, nothing herein shall supersede or modify
	the terms of any separate license agreement you may have executed
	with Licensor regarding such Contributions.

	6. Trademarks. This License does not grant permission to use the trade
	names, trademarks, service marks, or product names of the Licensor,
	except as required for reasonable and customary use in describing the
	origin of the Work and reproducing the content of the NOTICE file.

	7. Disclaimer of Warranty. Unless required by applicable law or
	agreed to in writing, Licensor provides the Work (and each
	Contributor provides its Contributions) on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
	implied, including, without limitation, any warranties or conditions
	of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
	PARTICULAR PURPOSE. You are solely responsible for determining the
	appropriateness of using or redistributing the Work and assume any
	risks associated with Your exercise of permissions under this License.

	8. Limitation of Liability. In no event and under no legal theory,
	whether in tort (including negligence), contract, or otherwise,
	unless required by applicable law (such as deliberate and grossly
	negligent acts) or agreed to in writing, shall any Contributor be
	liable to You for damages, including any direct, indirect, special,
	incidental, or consequential damages of any character arising as a
	result of this License or out of the use or inability to use the
	Work (including but not limited to damages for loss of goodwill,
	work stoppage, computer failure or malfunction, or any and all
	other commercial damages or losses), even if such Contributor
	has been advised of the possibility of such damages.

	9. Accepting Warranty or Additional Liability. While redistributing
	the Work or Derivative Works thereof, You may choose to offer,
	and charge a fee for, acceptance of support, warranty, indemnity,
	or other liability obligations and/or rights consistent with this
	License. However, in accepting such obligations, You may act only
	on Your own behalf and on Your sole responsibility, not on behalf
	of any other Contributor, and only if You agree to indemnify,
	defend, and hold each Contributor harmless for any liability
	incurred by, or claims asserted against, such Contributor by reason
	of your accepting any such warranty or additional liability.

	END OF TERMS AND CONDITIONS

	APPENDIX: How to apply the Apache License to your work.

	To apply the Apache License to your work, attach the following
	boilerplate notice, with the fields enclosed by brackets "[]"
	replaced with your own identifying information. (Don't include
	the brackets!) The text should be enclosed in the appropriate
	comment syntax for the file format. We also recommend that a
	file or class name and description of purpose be included on the
	same "printed page" as the copyright notice for easier
	identification within third-party archives.

	Copyright [2023] [Taranjeet Singh]

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	from langchain.document_loaders import PyPDFLoader

	from embedchain.utils import clean_string


	class PdfFileLoader:

	def load_data(self, url):
	loader = PyPDFLoader(url)
	output = []
	pages = loader.load_and_split()
	if not len(pages):
	raise ValueError("No data found")
	for page in pages:
	content = page.page_content
	content = clean_string(content)
	meta_data = page.metadata
	meta_data["url"] = url
	output.append({
	"content": content,
	"meta_data": meta_data,
	})
	return output