Comments (1)
Hey @hamero,
I've started working on your request to add support for customizing the location where the ChromaDB stores the index. The plan is to modify the constructors of the EmbedChain
and ChromaDB
classes to accept an additional parameter for the custom location. This will allow you to specify the location when creating an instance of the EmbedChain
class. I'll also update the README with instructions on how to use this new feature.
Give me a minute!
Some code snippets I looked at (click to expand). If some file is missing from here, you can mention the path in the ticket description.
embedchain/embedchain/vectordb/chroma_db.py
Lines 1 to 32 in 77c8a32
import chromadb | |
import os | |
from chromadb.utils import embedding_functions | |
from embedchain.vectordb.base_vector_db import BaseVectorDB | |
openai_ef = embedding_functions.OpenAIEmbeddingFunction( | |
api_key=os.getenv("OPENAI_API_KEY"), | |
organization_id=os.getenv("OPENAI_ORGANIZATION"), | |
model_name="text-embedding-ada-002" | |
) | |
class ChromaDB(BaseVectorDB): | |
def __init__(self, db_dir=None, ef=None): | |
self.ef = ef if ef is not None else openai_ef | |
if db_dir is None: | |
db_dir = "db" | |
self.client_settings = chromadb.config.Settings( | |
chroma_db_impl="duckdb+parquet", | |
persist_directory=db_dir, | |
anonymized_telemetry=False | |
) | |
super().__init__() | |
def _get_or_create_db(self): | |
return chromadb.Client(self.client_settings) | |
def _get_or_create_collection(self): | |
return self.client.get_or_create_collection( | |
'embedchain_store', embedding_function=self.ef, | |
) |
Lines 1 to 265 in 77c8a32
# embedchain | |
[![](https://dcbadge.vercel.app/api/server/nhvCbCtKV?style=flat)](https://discord.gg/nhvCbCtKV) | |
[![PyPI](https://img.shields.io/pypi/v/embedchain)](https://pypi.org/project/embedchain/) | |
embedchain is a framework to easily create LLM powered bots over any dataset. If you want a javascript version, check out [embedchain-js](https://github.com/embedchain/embedchainjs) | |
# Latest Updates | |
* Introduce a new app type called `OpenSourceApp`. It uses `gpt4all` as the LLM and `sentence transformers` all-MiniLM-L6-v2 as the embedding model. If you use this app, you dont have to pay for anything. | |
# What is embedchain? | |
Embedchain abstracts the entire process of loading a dataset, chunking it, creating embeddings and then storing in a vector database. | |
You can add a single or multiple dataset using `.add` and `.add_local` function and then use `.query` function to find an answer from the added datasets. | |
If you want to create a Naval Ravikant bot which has 1 youtube video, 1 book as pdf and 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the videos, pdf and blog posts and the QnA pair and embedchain will create a bot for you. | |
```python | |
from embedchain import App | |
naval_chat_bot = App() | |
# Embed Online Resources | |
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") | |
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") | |
naval_chat_bot.add("web_page", "https://nav.al/feedback") | |
naval_chat_bot.add("web_page", "https://nav.al/agi") | |
# Embed Local Resources | |
naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor.")) | |
naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?") | |
# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality. | |
``` | |
# Getting Started | |
## Installation | |
First make sure that you have the package installed. If not, then install it using `pip` | |
```bash | |
pip install embedchain | |
``` | |
## Usage | |
Creating a chatbot involves 3 steps: | |
- import the App instance | |
- add dataset | |
- query on the dataset and get answers | |
### App Types | |
We have two types of App. | |
#### 1. App (uses OpenAI models, paid) | |
```python | |
from embedchain import App | |
naval_chat_bot = App() | |
``` | |
* `App` uses OpenAI's model, so these are paid models. You will be charged for embedding model usage and LLM usage. | |
* `App` uses OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visiting [this link](https://platform.openai.com/account/api-keys). | |
* Once you have the API key, set it in an environment variable called `OPENAI_API_KEY` | |
```python | |
import os | |
os.environ["OPENAI_API_KEY"] = "sk-xxxx" | |
``` | |
#### 2. OpenSourceApp (uses opensource models, free) | |
```python | |
from embedchain import OpenSourceApp | |
naval_chat_bot = OpenSourceApp() | |
``` | |
* `OpenSourceApp` uses open source embedding and LLM model. It uses `all-MiniLM-L6-v2` from Sentence Transformers library as the embedding model and `gpt4all` as the LLM. | |
* Here there is no need to setup any api keys. You just need to install embedchain package and these will get automatically installed. | |
* Once you have imported and instantiated the app, every functionality from here onwards is the same for either type of app. | |
### Add data set and query | |
* This step assumes that you have already created an `app` instance by either using `App` or `OpenSourceApp`. We are calling our app instance as `naval_chat_bot` | |
* Now use `.add` function to add any dataset. | |
```python | |
# naval_chat_bot = App() or | |
# naval_chat_bot = OpenSourceApp() | |
# Embed Online Resources | |
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") | |
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") | |
naval_chat_bot.add("web_page", "https://nav.al/feedback") | |
naval_chat_bot.add("web_page", "https://nav.al/agi") | |
# Embed Local Resources | |
naval_chat_bot.add_local("qna_pair", ("Who is Naval Ravikant?", "Naval Ravikant is an Indian-American entrepreneur and investor.")) | |
``` | |
* If there is any other app instance in your script or app, you can change the import as | |
```python | |
from embedchain import App as EmbedChainApp | |
from embedchain import OpenSourceApp as EmbedChainOSApp | |
# or | |
from embedchain import App as ECApp | |
from embedchain import OpenSourceApp as ECOSApp | |
``` | |
* Now your app is created. You can use `.query` function to get the answer for any query. | |
```python | |
print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")) | |
# answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality. | |
``` | |
## Format supported | |
We support the following formats: | |
### Youtube Video | |
To add any youtube video to your app, use the data_type (first argument to `.add`) as `youtube_video`. Eg: | |
```python | |
app.add('youtube_video', 'a_valid_youtube_url_here') | |
``` | |
### PDF File | |
To add any pdf file, use the data_type as `pdf_file`. Eg: | |
```python | |
app.add('pdf_file', 'a_valid_url_where_pdf_file_can_be_accessed') | |
``` | |
Note that we do not support password protected pdfs. | |
### Web Page | |
To add any web page, use the data_type as `web_page`. Eg: | |
```python | |
app.add('web_page', 'a_valid_web_page_url') | |
``` | |
### Text | |
To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg: | |
```python | |
app.add_local('text', 'Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.') | |
``` | |
Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit. | |
### QnA Pair | |
To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg: | |
```python | |
app.add_local('qna_pair', ("Question", "Answer")) | |
``` | |
### Reusing a Vector DB | |
Default behavior is to create a persistent vector DB in the directory **./db**. You can split your application into two Python scripts: one to create a local vector DB and the other to reuse this local persistent vector DB. This is useful when you want to index hundreds of documents and separately implement a chat interface. | |
Create a local index: | |
```python | |
from embedchain import App | |
naval_chat_bot = App() | |
naval_chat_bot.add("youtube_video", "https://www.youtube.com/watch?v=3qHkcs3kG44") | |
naval_chat_bot.add("pdf_file", "https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf") | |
``` | |
You can reuse the local index with the same code, but without adding new documents: | |
```python | |
from embedchain import App | |
naval_chat_bot = App() | |
print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?")) | |
``` | |
### More Formats coming soon | |
* If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchain/issues) and we will add it to the list of supported formats. | |
# How does it work? | |
Creating a chat bot over any dataset needs the following steps to happen | |
* load the data | |
* create meaningful chunks | |
* create embeddings for each chunk | |
* store the chunks in vector database | |
Whenever a user asks any query, following process happens to find the answer for the query | |
* create the embedding for query | |
* find similar documents for this query from vector database | |
* pass similar documents as context to LLM to get the final answer. | |
The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own. | |
* How should I chunk the data? What is a meaningful chunk size? | |
* How should I create embeddings for each chunk? Which embedding model should I use? | |
* How should I store the chunks in vector database? Which vector database should I use? | |
* Should I store meta data along with the embeddings? | |
* How should I find similar documents for a query? Which ranking model should I use? | |
These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers. | |
embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset. | |
In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer. | |
# Tech Stack | |
embedchain is built on the following stack: | |
- [Langchain](https://github.com/hwchase17/langchain) as an LLM framework to load, chunk and index data | |
- [OpenAI's Ada embedding model](https://platform.openai.com/docs/guides/embeddings) to create embeddings | |
- [OpenAI's ChatGPT API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) as LLM to get answers given the context | |
- [Chroma](https://github.com/chroma-core/chroma) as the vector database to store embeddings | |
- [gpt4all](https://github.com/nomic-ai/gpt4all) as an open source LLM | |
- [sentence-transformers](https://huggingface.co/sentence-transformers) as open source embedding model | |
# Author | |
* Taranjeet Singh ([@taranjeetio](https://twitter.com/taranjeetio)) | |
## Citation | |
If you utilize this repository, please consider citing it with: | |
``` | |
@misc{embedchain, | |
author = {Taranjeet Singh}, | |
title = {Embechain: Framework to easily create LLM powered bots over any dataset}, | |
year = {2023}, | |
publisher = {GitHub}, | |
journal = {GitHub repository}, | |
howpublished = {\url{https://github.com/embedchain/embedchain}}, | |
} |
embedchain/embedchain/embedchain.py
Lines 1 to 204 in 77c8a32
import os | |
from chromadb.utils import embedding_functions | |
from dotenv import load_dotenv | |
from gpt4all import GPT4All | |
from langchain.docstore.document import Document | |
from langchain.embeddings.openai import OpenAIEmbeddings | |
from embedchain.loaders.youtube_video import YoutubeVideoLoader | |
from embedchain.loaders.pdf_file import PdfFileLoader | |
from embedchain.loaders.web_page import WebPageLoader | |
from embedchain.loaders.local_qna_pair import LocalQnaPairLoader | |
from embedchain.loaders.local_text import LocalTextLoader | |
from embedchain.chunkers.youtube_video import YoutubeVideoChunker | |
from embedchain.chunkers.pdf_file import PdfFileChunker | |
from embedchain.chunkers.web_page import WebPageChunker | |
from embedchain.chunkers.qna_pair import QnaPairChunker | |
from embedchain.chunkers.text import TextChunker | |
from embedchain.vectordb.chroma_db import ChromaDB | |
openai_ef = embedding_functions.OpenAIEmbeddingFunction( | |
api_key=os.getenv("OPENAI_API_KEY"), | |
organization_id=os.getenv("OPENAI_ORGANIZATION"), | |
model_name="text-embedding-ada-002" | |
) | |
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2") | |
gpt4all_model = None | |
load_dotenv() | |
ABS_PATH = os.getcwd() | |
DB_DIR = os.path.join(ABS_PATH, "db") | |
class EmbedChain: | |
def __init__(self, db=None, ef=None): | |
""" | |
Initializes the EmbedChain instance, sets up a vector DB client and | |
creates a collection. | |
:param db: The instance of the VectorDB subclass. | |
""" | |
if db is None: | |
db = ChromaDB(ef=ef) | |
self.db_client = db.client | |
self.collection = db.collection | |
self.user_asks = [] | |
def _get_loader(self, data_type): | |
""" | |
Returns the appropriate data loader for the given data type. | |
:param data_type: The type of the data to load. | |
:return: The loader for the given data type. | |
:raises ValueError: If an unsupported data type is provided. | |
""" | |
loaders = { | |
'youtube_video': YoutubeVideoLoader(), | |
'pdf_file': PdfFileLoader(), | |
'web_page': WebPageLoader(), | |
'qna_pair': LocalQnaPairLoader(), | |
'text': LocalTextLoader(), | |
} | |
if data_type in loaders: | |
return loaders[data_type] | |
else: | |
raise ValueError(f"Unsupported data type: {data_type}") | |
def _get_chunker(self, data_type): | |
""" | |
Returns the appropriate chunker for the given data type. | |
:param data_type: The type of the data to chunk. | |
:return: The chunker for the given data type. | |
:raises ValueError: If an unsupported data type is provided. | |
""" | |
chunkers = { | |
'youtube_video': YoutubeVideoChunker(), | |
'pdf_file': PdfFileChunker(), | |
'web_page': WebPageChunker(), | |
'qna_pair': QnaPairChunker(), | |
'text': TextChunker(), | |
} | |
if data_type in chunkers: | |
return chunkers[data_type] | |
else: | |
raise ValueError(f"Unsupported data type: {data_type}") | |
def add(self, data_type, url): | |
""" | |
Adds the data from the given URL to the vector db. | |
Loads the data, chunks it, create embedding for each chunk | |
and then stores the embedding to vector database. | |
:param data_type: The type of the data to add. | |
:param url: The URL where the data is located. | |
""" | |
loader = self._get_loader(data_type) | |
chunker = self._get_chunker(data_type) | |
self.user_asks.append([data_type, url]) | |
self.load_and_embed(loader, chunker, url) | |
def add_local(self, data_type, content): | |
""" | |
Adds the data you supply to the vector db. | |
Loads the data, chunks it, create embedding for each chunk | |
and then stores the embedding to vector database. | |
:param data_type: The type of the data to add. | |
:param content: The local data. Refer to the `README` for formatting. | |
""" | |
loader = self._get_loader(data_type) | |
chunker = self._get_chunker(data_type) | |
self.user_asks.append([data_type, content]) | |
self.load_and_embed(loader, chunker, content) | |
def load_and_embed(self, loader, chunker, url): | |
""" | |
Loads the data from the given URL, chunks it, and adds it to the database. | |
:param loader: The loader to use to load the data. | |
:param chunker: The chunker to use to chunk the data. | |
:param url: The URL where the data is located. | |
""" | |
embeddings_data = chunker.create_chunks(loader, url) | |
documents = embeddings_data["documents"] | |
metadatas = embeddings_data["metadatas"] | |
ids = embeddings_data["ids"] | |
# get existing ids, and discard doc if any common id exist. | |
existing_docs = self.collection.get( | |
ids=ids, | |
# where={"url": url} | |
) | |
existing_ids = set(existing_docs["ids"]) | |
if len(existing_ids): | |
data_dict = {id: (doc, meta) for id, doc, meta in zip(ids, documents, metadatas)} | |
data_dict = {id: value for id, value in data_dict.items() if id not in existing_ids} | |
if not data_dict: | |
print(f"All data from {url} already exists in the database.") | |
return | |
ids = list(data_dict.keys()) | |
documents, metadatas = zip(*data_dict.values()) | |
self.collection.add( | |
documents=documents, | |
metadatas=metadatas, | |
ids=ids | |
) | |
print(f"Successfully saved {url}. Total chunks count: {self.collection.count()}") | |
def _format_result(self, results): | |
return [ | |
(Document(page_content=result[0], metadata=result[1] or {}), result[2]) | |
for result in zip( | |
results["documents"][0], | |
results["metadatas"][0], | |
results["distances"][0], | |
) | |
] | |
def get_llm_model_answer(self, prompt): | |
raise NotImplementedError | |
def retrieve_from_database(self, input_query): | |
""" | |
Queries the vector database based on the given input query. | |
Gets relevant doc based on the query | |
:param input_query: The query to use. | |
:return: The content of the document that matched your query. | |
""" | |
result = self.collection.query( | |
query_texts=[input_query,], | |
n_results=1, | |
) | |
result_formatted = self._format_result(result) | |
if result_formatted: | |
content = result_formatted[0][0].page_content | |
else: | |
content = "" | |
return content | |
def generate_prompt(self, input_query, context): | |
""" | |
Generates a prompt based on the given query and context, ready to be passed to an LLM | |
:param input_query: The query to use. | |
:param context: Similar documents to the query used as context. | |
:return: The prompt | |
""" | |
prompt = f"""Use the following pieces of context to answer the query at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. | |
{context} | |
Query: {input_query} | |
Helpful Answer: | |
""" | |
return prompt | |
def get_answer_from_llm(self, prompt): | |
""" | |
Gets an answer based on the given query and context by passing it |
Lines 20 to 194 in 77c8a32
otherwise, or (ii) ownership of fifty percent (50%) or more of the | |
outstanding shares, or (iii) beneficial ownership of such entity. | |
"You" (or "Your") shall mean an individual or Legal Entity | |
exercising permissions granted by this License. | |
"Source" form shall mean the preferred form for making modifications, | |
including but not limited to software source code, documentation | |
source, and configuration files. | |
"Object" form shall mean any form resulting from mechanical | |
transformation or translation of a Source form, including but | |
not limited to compiled object code, generated documentation, | |
and conversions to other media types. | |
"Work" shall mean the work of authorship, whether in Source or | |
Object form, made available under the License, as indicated by a | |
copyright notice that is included in or attached to the work | |
(an example is provided in the Appendix below). | |
"Derivative Works" shall mean any work, whether in Source or Object | |
form, that is based on (or derived from) the Work and for which the | |
editorial revisions, annotations, elaborations, or other modifications | |
represent, as a whole, an original work of authorship. For the purposes | |
of this License, Derivative Works shall not include works that remain | |
separable from, or merely link (or bind by name) to the interfaces of, | |
the Work and Derivative Works thereof. | |
"Contribution" shall mean any work of authorship, including | |
the original version of the Work and any modifications or additions | |
to that Work or Derivative Works thereof, that is intentionally | |
submitted to Licensor for inclusion in the Work by the copyright owner | |
or by an individual or Legal Entity authorized to submit on behalf of | |
the copyright owner. For the purposes of this definition, "submitted" | |
means any form of electronic, verbal, or written communication sent | |
to the Licensor or its representatives, including but not limited to | |
communication on electronic mailing lists, source code control systems, | |
and issue tracking systems that are managed by, or on behalf of, the | |
Licensor for the purpose of discussing and improving the Work, but | |
excluding communication that is conspicuously marked or otherwise | |
designated in writing by the copyright owner as "Not a Contribution." | |
"Contributor" shall mean Licensor and any individual or Legal Entity | |
on behalf of whom a Contribution has been received by Licensor and | |
subsequently incorporated within the Work. | |
2. Grant of Copyright License. Subject to the terms and conditions of | |
this License, each Contributor hereby grants to You a perpetual, | |
worldwide, non-exclusive, no-charge, royalty-free, irrevocable | |
copyright license to reproduce, prepare Derivative Works of, | |
publicly display, publicly perform, sublicense, and distribute the | |
Work and such Derivative Works in Source or Object form. | |
3. Grant of Patent License. Subject to the terms and conditions of | |
this License, each Contributor hereby grants to You a perpetual, | |
worldwide, non-exclusive, no-charge, royalty-free, irrevocable | |
(except as stated in this section) patent license to make, have made, | |
use, offer to sell, sell, import, and otherwise transfer the Work, | |
where such license applies only to those patent claims licensable | |
by such Contributor that are necessarily infringed by their | |
Contribution(s) alone or by combination of their Contribution(s) | |
with the Work to which such Contribution(s) was submitted. If You | |
institute patent litigation against any entity (including a | |
cross-claim or counterclaim in a lawsuit) alleging that the Work | |
or a Contribution incorporated within the Work constitutes direct | |
or contributory patent infringement, then any patent licenses | |
granted to You under this License for that Work shall terminate | |
as of the date such litigation is filed. | |
4. Redistribution. You may reproduce and distribute copies of the | |
Work or Derivative Works thereof in any medium, with or without | |
modifications, and in Source or Object form, provided that You | |
meet the following conditions: | |
(a) You must give any other recipients of the Work or | |
Derivative Works a copy of this License; and | |
(b) You must cause any modified files to carry prominent notices | |
stating that You changed the files; and | |
(c) You must retain, in the Source form of any Derivative Works | |
that You distribute, all copyright, patent, trademark, and | |
attribution notices from the Source form of the Work, | |
excluding those notices that do not pertain to any part of | |
the Derivative Works; and | |
(d) If the Work includes a "NOTICE" text file as part of its | |
distribution, then any Derivative Works that You distribute must | |
include a readable copy of the attribution notices contained | |
within such NOTICE file, excluding those notices that do not | |
pertain to any part of the Derivative Works, in at least one | |
of the following places: within a NOTICE text file distributed | |
as part of the Derivative Works; within the Source form or | |
documentation, if provided along with the Derivative Works; or, | |
within a display generated by the Derivative Works, if and | |
wherever such third-party notices normally appear. The contents | |
of the NOTICE file are for informational purposes only and | |
do not modify the License. You may add Your own attribution | |
notices within Derivative Works that You distribute, alongside | |
or as an addendum to the NOTICE text from the Work, provided | |
that such additional attribution notices cannot be construed | |
as modifying the License. | |
You may add Your own copyright statement to Your modifications and | |
may provide additional or different license terms and conditions | |
for use, reproduction, or distribution of Your modifications, or | |
for any such Derivative Works as a whole, provided Your use, | |
reproduction, and distribution of the Work otherwise complies with | |
the conditions stated in this License. | |
5. Submission of Contributions. Unless You explicitly state otherwise, | |
any Contribution intentionally submitted for inclusion in the Work | |
by You to the Licensor shall be under the terms and conditions of | |
this License, without any additional terms or conditions. | |
Notwithstanding the above, nothing herein shall supersede or modify | |
the terms of any separate license agreement you may have executed | |
with Licensor regarding such Contributions. | |
6. Trademarks. This License does not grant permission to use the trade | |
names, trademarks, service marks, or product names of the Licensor, | |
except as required for reasonable and customary use in describing the | |
origin of the Work and reproducing the content of the NOTICE file. | |
7. Disclaimer of Warranty. Unless required by applicable law or | |
agreed to in writing, Licensor provides the Work (and each | |
Contributor provides its Contributions) on an "AS IS" BASIS, | |
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | |
implied, including, without limitation, any warranties or conditions | |
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A | |
PARTICULAR PURPOSE. You are solely responsible for determining the | |
appropriateness of using or redistributing the Work and assume any | |
risks associated with Your exercise of permissions under this License. | |
8. Limitation of Liability. In no event and under no legal theory, | |
whether in tort (including negligence), contract, or otherwise, | |
unless required by applicable law (such as deliberate and grossly | |
negligent acts) or agreed to in writing, shall any Contributor be | |
liable to You for damages, including any direct, indirect, special, | |
incidental, or consequential damages of any character arising as a | |
result of this License or out of the use or inability to use the | |
Work (including but not limited to damages for loss of goodwill, | |
work stoppage, computer failure or malfunction, or any and all | |
other commercial damages or losses), even if such Contributor | |
has been advised of the possibility of such damages. | |
9. Accepting Warranty or Additional Liability. While redistributing | |
the Work or Derivative Works thereof, You may choose to offer, | |
and charge a fee for, acceptance of support, warranty, indemnity, | |
or other liability obligations and/or rights consistent with this | |
License. However, in accepting such obligations, You may act only | |
on Your own behalf and on Your sole responsibility, not on behalf | |
of any other Contributor, and only if You agree to indemnify, | |
defend, and hold each Contributor harmless for any liability | |
incurred by, or claims asserted against, such Contributor by reason | |
of your accepting any such warranty or additional liability. | |
END OF TERMS AND CONDITIONS | |
APPENDIX: How to apply the Apache License to your work. | |
To apply the Apache License to your work, attach the following | |
boilerplate notice, with the fields enclosed by brackets "[]" | |
replaced with your own identifying information. (Don't include | |
the brackets!) The text should be enclosed in the appropriate | |
comment syntax for the file format. We also recommend that a | |
file or class name and description of purpose be included on the | |
same "printed page" as the copyright notice for easier | |
identification within third-party archives. | |
Copyright [2023] [Taranjeet Singh] | |
Licensed under the Apache License, Version 2.0 (the "License"); | |
you may not use this file except in compliance with the License. | |
You may obtain a copy of the License at | |
embedchain/embedchain/loaders/pdf_file.py
Lines 1 to 23 in 77c8a32
from langchain.document_loaders import PyPDFLoader | |
from embedchain.utils import clean_string | |
class PdfFileLoader: | |
def load_data(self, url): | |
loader = PyPDFLoader(url) | |
output = [] | |
pages = loader.load_and_split() | |
if not len(pages): | |
raise ValueError("No data found") | |
for page in pages: | |
content = page.page_content | |
content = clean_string(content) | |
meta_data = page.metadata | |
meta_data["url"] = url | |
output.append({ | |
"content": content, | |
"meta_data": meta_data, | |
}) | |
return output |
I'm a bot that handles simple bugs and feature requests but I might make mistakes. Please be kind!
from embedchain.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from embedchain.