-
Demo URL: https://huggingface.co/spaces/bhaskartripathi/pdfChatter
-
Demo Video:
-
Despite so many fancy RAG solutions out there in Open Source and enterprise apps, pdfGPT is still the most accurate application that gives the most precise response. The first version was developed way back in 2021 as one of the world's earliest RAG open source solutions. To this day (Dec, 2024), it still remains one of the most accurate ones due to its very simple and unique architecture. It uses no third-party APIs such as langchain. It uses embeddings but no vectorDB, no indexing. But, it still doesn't compromise on the accuracy of response which is more critical than a fancy UI. The library documentation that you see below is a bit outdated as I do not get enough time to maintain it. However, if there is more demand then I am ready to put an enterprise grade RAG with more sophisticated retrieval tech available these days.
- Improved error handling
- PDF GPT now supports Turbo models and GPT4 including 16K and 32K token model.
- Pre-defined questions for auto-filling the input.
- Implemented Chat History feature.
If you find the response for a specific question in the PDF is not good using Turbo models, then you need to understand that Turbo models such as gpt-3.5-turbo are chat completion models and will not give a good response in some cases where the embedding similarity is low. Despite the claim by OpenAI, the turbo model is not the best model for Q&A. In those specific cases, either use the good old text-DaVinci-003 or use GPT4 and above. These models invariably give you the most relevant output.
- Support for Falcon, Vicuna, Meta Llama
- OCR Support
- Multiple PDF file support
- OCR Support
- Node.Js based Web Application - With no trial, no API fees. 100% Open source.
- When you pass a large text to Open AI, it suffers from a 4K token limit. It cannot take an entire pdf file as an input
- Open AI sometimes becomes overtly chatty and returns irrelevant response not directly related to your query. This is because Open AI uses poor embeddings.
- ChatGPT cannot directly talk to external data. Some solutions use Langchain but it is token hungry if not implemented correctly.
- There are a number of solutions like https://www.chatpdf.com, https://www.bespacific.com/chat-with-any-pdf, https://www.filechat.io they have poor content quality and are prone to hallucination problem. One good way to avoid hallucinations and improve truthfulness is to use improved embeddings. To solve this problem, I propose to improve embeddings with Universal Sentence Encoder family of algorithms (Read more here: https://tfhub.dev/google/collections/universal-sentence-encoder/1).
- PDF GPT allows you to chat with an uploaded PDF file using GPT functionalities.
- The application intelligently breaks the document into smaller chunks and employs a powerful Deep Averaging Network Encoder to generate embeddings.
- A semantic search is first performed on your pdf content and the most relevant embeddings are passed to the Open AI.
- A custom logic generates precise responses. The returned response can even cite the page number in square brackets([]) where the information is located, adding credibility to the responses and helping to locate pertinent information quickly. The Responses are much better than the naive responses by Open AI.
- Andrej Karpathy mentioned in this post that KNN algorithm is most appropriate for similar problems: https://twitter.com/karpathy/status/1647025230546886658
- Enables APIs on Production using langchain-serve.
Run docker-compose -f docker-compose.yaml up
to use it with Docker compose.
sequenceDiagram
participant User
participant System
User->>System: Enter API Key
User->>System: Upload PDF/PDF URL
User->>System: Ask Question
User->>System: Submit Call to Action
System->>System: Blank field Validations
System->>System: Convert PDF to Text
System->>System: Decompose Text to Chunks (150 word length)
System->>System: Check if embeddings file exists
System->>System: If file exists, load embeddings and set the fitted attribute to True
System->>System: If file doesn't exist, generate embeddings, fit the recommender, save embeddings to file and set fitted attribute to True
System->>System: Perform Semantic Search and return Top 5 Chunks with KNN
System->>System: Load Open AI prompt
System->>System: Embed Top 5 Chunks in Open AI Prompt
System->>System: Generate Answer with Davinci
System-->>User: Return Answer
flowchart TB
A[Input] --> B[URL]
A -- Upload File manually --> C[Parse PDF]
B --> D[Parse PDF] -- Preprocess --> E[Dynamic Text Chunks]
C -- Preprocess --> E[Dynamic Text Chunks with citation history]
E --Fit-->F[Generate text embedding with Deep Averaging Network Encoder on each chunk]
F -- Query --> G[Get Top Results]
G -- K-Nearest Neighbour --> K[Get Nearest Neighbour - matching citation references]
K -- Generate Prompt --> H[Generate Answer]
H -- Output --> I[Output]
I am looking for more contributors from the open source community who can take up backlog items voluntarily and maintain the application jointly with me.
This project, graphGita, is the first modern re-interpretation of the Bhagavad Gita that utilizes Knowledge Graphs for accurate query retrieval and qunatify philosphical aspects to serve specific problem-solution needs. The ambitious goal is to incorporate over 200 versions of Gita interpretations written from time-to-time by different past and present scholars and integrate them in form of a sophisticated Knowledge Graph aided with modern retrieval technologies such as Monte Carlo Tree Search, and KG-RAG to provide a seamless multi-modal experience (text, image and video) to users. My primary goal is to increase readers' comprehension of philosophical ideas while offering pertinent perspectives for modern readers. Based on the literature reviews of each of the 18 chapters and how they relate to one another, the text is formatted into a graph structure. This structure may grow more sophisticated and complex with due course of time as the project progresses. ๐ ๐ฅ https://github.com/bhaskatripathi/graphGita
This project is licensed under the MIT License. See the LICENSE.txt file for details.
If you use PDF-GPT in your research or wish to refer to the examples in this repo, please cite with:
@misc{pdfgpt2023,
author = {Bhaskar Tripathi},
title = {PDF-GPT},
year = {2023},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/bhaskatripathi/pdfGPT}}
}
pdfgpt's People
Forkers
moris-polanco jive-akira jinjun1994 cyjme faisal-alsrheed xwyangjshb justmg snowind kuddah johnliu33 mpolanco-ifyl luomin aaron-te-moore boomer001 paolo33333 andrewtrench nilportugues milyiyo bonabobo rivien2010 liugod digitalbean mozzipa frankatmech-github wanghaoddd eudaimoniatech xingyaoww jmourad zone1511 itsjameshan amit1nayak wwwlinkdp yousseefs imchenmin annias rozgo wikiup marcusps11 michleorange jane719 benjaminv mateeqazam cryptonaidu jasperb25 xiaochenggushiduo rajeshalda summerflowers ferencfresz stanchiu224 johnhorsema lannfs msfm360 fcgca richardkaplan petragom hirajanwin sfoucher yhoyo paulsunnypark huanglinpuyu iouen divem lior-lemonade deepusnath lvanderwalt richelynscott philsad myh-st jiaxiongweng-conor marcianobarros20 anas-zafar xxxkrokodil krowek ccomkhj webclinic017 joshgavinhong techthiyanes hyojunguy join2aj ariburaco ali2kan happyboy1991 cbryg brianjking vichai-glean danhthevodanh petercao qq534917582 aabhishekdata qichaoliang jviggiani octalpixel quantumtau wdshin jtyagi-ai toannd96 santoshdahale jackychengit serempre mrjaggupdfgpt's Issues
Using same file/url results in having to reload entire document and reload chunks etc.
Use case is we want to ask multiple questions using the same file (i.e. the file is a faq and we have a number of different questions to ask about it (the questions are unrelated so this isn't a conversation just a plain question).
Using same file/url results in having to reload entire document and reload chunks etc. would be nice if the url or file are unchanged from the last submission that this process didn't need to occur since we already have generated the recommender previously.
Got error when using Chinese
I really like your app, however, I uploaded a pdf that is in Chinese, and I also asked a question in Chinese, but when I run the result, I got an error.
When will it support Chinese pdf? Really looking forward to it.
More features: formatting, actual chat, and show PDF
I wanted to see if we could improve the current feature set. The following is what I have in mind:
- Add formatting to answer (new line between each citation, make the citations go from [4] to [Page 4] and make them a different color so they stand out (making it easy to reference)
- Have an actual chat-like interface (think humata, chatpdf)
- Show the actual pdf side-by-side (and maybe the citations in the answers can link to the page in the pdf viewer making it easy to reference)
These features would be dope. I'm looking into working on it myself but I know there's people more talented that could look into this
so where is the index data?
When preparing questions and answers, we typically follow these steps:
- Extract text from various data sources, such as websites, PDFs, CSVs, or plain text files. The extracted text can be saved in different locations.
- Create embeddings, which produce an index file as output.
- Answer questions by referencing the index created in the previous step.
However, when I run the API and App locally for this project, I cannot see the data generated in steps (1) and (2). I am curious about where the data is stored and how it works.
server disconnected error
No module named 'gradio'
Just a suggestion. Maybe gradio should be included in the requirements.
I had to install it manually.
cant lc serve
'lc-serve' is not recognized as an internal or external command,
operable program or batch file.
Fascinating - Can I run this locally with Vicuna? and any vector index?
Thanks Bhaskar - can the above be done? Happy to chat more
Hallucinating on the page numbers
I tried it and it did pretty good but I found out that the pages it gives me to refer are not accurate.
Prompt Optimization by upto 50% more token input
By removing whitespaces and trimming most letters to the most discernible sentence possible we can allow even more tokens inside our prompt by upto a staggering 50% and hence even more larger support of pdf sizes to occur. The algorithm in use will be taken from a reputable source linked after completing the issue.
Please allow me to assign to this issue and I will make a stable release sometime in June. Thank you.
Requirements.txt not complete
It was missing fitz and frontend.
about method of TopN
deleted
API key save in .env for one time key save
I have tried other gpt apps for pdf . Your app is great but need one option as i think.(suggestion)
If we can save api file permanent , so no need to add everytime.
Can you implement it.
LICENSE
Please insert license.
Thanks you,
Best regards.
pip install seems to take a long time
Did anyone see this issue? For me, it has been going on for a long time.
Create a docker
Please create a docker file for run pdfGPT.
Can't run the code
Hi.
I have tried running your code on both windows and an Ubbuntu VM. In both cases I had to install via pip Fitz and frontend libs in addition to what requirements.txt contains. Again in both cases I have this error:
File "/home/parallels/Documents/PythonScripts/PDFGPT/app.py", line 2, in
import fitz
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/fitz/init.py", line 1, in
from frontend import *
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/init.py", line 1, in
from .events import *
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/events/init.py", line 1, in
from .clipboard import *
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/events/clipboard.py", line 2, in
from ..dom import Event
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/dom.py", line 439, in
from . import dispatcher
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/dispatcher.py", line 15, in
from . import config, server
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/frontend/server.py", line 24, in
app.mount(config.STATIC_ROUTE, StaticFiles(directory=config.STATIC_DIRECTORY), name=config.STATIC_NAME)
File "/home/parallels/Documents/PythonScripts/PDFGPT/PDFGPT/lib/python3.10/site-packages/starlette/staticfiles.py", line 57, in init
raise RuntimeError(f"Directory '{directory}' does not exist")
RuntimeError: Directory 'static/' does not exist
Any ideas?
Thank you.
gradio api not working
Attribute error
I'm getting this error message when I run the code:
It happens after putting in the API key and uploading a PDF.
AttributeError: 'SemanticSearch' object has no attribute 'nn'
And I can't find out why it's happening.
Too slow
After I downloaded and installed all the dependencies, I run python app.py, but it just sits there and do nothing, not even after 30 minutes. What could be going on?
more errors
It works for a while then:
AttributeError: 'SemanticSearch' object has no attribute 'nn'
How to run this locally?
This is my by far the best Chatwithpdf kind of app. I would really appreciate, if there was a comprehensive guide on how to set it up locally. Since I am getting lot of errors.
Add ChatMemory to pdfGPT
I would be super valuable to be able to have a chat memory / history for pdfGPT
See here a langchain example:
https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html
Feature: Talk to youtube video transcriptions
Problem with class "SemanticSearch"
when running the application i get an error which says:
"AttributeError: 'SemanticSearch' object has no attribute 'nn' "
any clues as to why and how to fix the issue?
Anyway i love the idea thank you for the repo!
Error loading model
Error: Trying to load a model of incompatible/unknown type. 'C:\Users\User\AppData\Local\Temp\tfhub_modules\063d866c066fd46003be952409c' contains neither 'saved_model.pb' nor 'saved_model.pbtxt'.
I thought this uses OAI APK. Why is it trying to load a local model?
BUG: The hugging space demo is not working.
Unable to pull a particular Docker layer of pdfchatter
Hi, I ran the docker pull
command as suggested in the README
but I get the following output.
docker pull registry.hf.space/bhaskartripathi-pdfchatter:latest
latest: Pulling from bhaskartripathi-pdfchatter
bd8f6a7501cc: Pull complete
44718e6d535d: Pull complete
efe9738af0cb: Pull complete
f37aabde37b8: Pull complete
3923d444ed05: Pull complete
1ecef690e281: Pull complete
48673bbfd34d: Pull complete
b761c288f4b0: Pull complete
4ea6ac43d369: Pull complete
aa9e20aea25a: Extracting [==================================================>] 99.49MB/99.49MB
63248b4e37e2: Download complete
5806ef4fec33: Download complete
ec89491cf0cd: Download complete
e662a12eee66: Download complete
46995db4b389: Download complete
7d67ad956d91: Download complete
b025d72cdd42: Download complete
0bbbfa67eeab: Download complete
66aa17d0dc7e: Download complete
failed to register layer: Error processing tar file(exit status 1): archive/tar: invalid tar header
Is there maybe something wrong with the aa9e20aea25a
layer?
Alternative to OpenAI?
Do you see any alternative to OpenAI? I mean, a free open source alternative?
Telegram PDF Chat
Can we have this in telegram, the PDF file is put in dir folder
May need a better way tokenize characters...
Hello,
I recently encountered an issue while using your open source project. When I tried to use the project with Chinese characters, I received the following error message:
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 5803 tokens (1707 in your prompt; 4096 for the completion). Please reduce your prompt; or completion length.
I believe this issue might be due to a possible miscalculation in the token count for Chinese characters. I understand that the GPT model tokenizes text differently based on the language, and it is possible that the algorithm isn't accurately calculating the token count for Chinese text. This leads to an incorrect total token count and subsequently the InvalidRequestError.
To better diagnose and resolve this issue, I kindly request you to look into the algorithm's handling of Chinese characters, specifically in the tokenization process. It would be greatly appreciated if you could provide any guidance or potential fixes for this issue.
Thank you for your time and effort in maintaining this project. I'm looking forward to your response.
Getting error when running
using python3:
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/app.py", line 2, in
import fitz
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/fitz/init.py", line 1, in
from frontend import *
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/init.py", line 1, in
from .events import *
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/events/init.py", line 1, in
from .clipboard import *
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/events/clipboard.py", line 2, in
from ..dom import Event
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/dom.py", line 439, in
from . import dispatcher
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/dispatcher.py", line 15, in
from . import config, server
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/frontend/server.py", line 24, in
app.mount(config.STATIC_ROUTE, StaticFiles(directory=config.STATIC_DIRECTORY), name=config.STATIC_NAME)
File "/Users/m3kwong/PythonCode/LLM/pdfGPT-main/new/lib/python3.10/site-packages/starlette/staticfiles.py", line 57, in init
raise RuntimeError(f"Directory '{directory}' does not exist")
RuntimeError: Directory 'static/' does not exist
M1 Mac Tensorflow
Anyone able to get this to work on MacOS? I am trying to get this working in a virtual environment with Python 3.10 and have to use tensorflow-macos and tensorflow-metal.
Unsure what to change in requirements but I keep getting errors when trying to install all the dependencies.
On mobile right now but will update with full errors when I get back to my laptop.
Error using app webpage
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/routes.py", line 414, in run_predict
output = await app.get_blocks().process_api(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/blocks.py", line 1320, in process_api
result = await self.call_function(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gradio/blocks.py", line 1048, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/taoruifu/work/projects/llm/pdfGPT/app.py", line 49, in ask_api
raise ValueError(f'[ERROR]: {r.text}')
Cannot connect to "0.0.0.0:8080" after running "lc-serve deploy local api"
I have issues with the local playground. When I deploy langchain-serve locally, it seems to work, it says "Flow is ready to serve!" on "0.0.0.0:8080", but when I tried to access one of the listed endpoints from the browser it says that "This site canโt be reached".
Deploy `pdfGPT` as APIs locally/on cloud using `langchain-serve`
Repo - langchain-serve.
- Exposes APIs from function definitions locally as well as on the cloud.
- Very few lines of code changes and ease of development remain the same as local.
- Supports both REST & WebSocket endpoints
- Serverless/autoscaling endpoints with automatic tls certs on the cloud.
- Real-time streaming, human-in-the-loop support - which is crucial for chatbots.
We can extend the simple existing app pdf-qna on langchain-serve.
Disclaimer: I'm the primary author of langchain-serve.
error cant create file already exists
Everytime I try to load a local PDF, it answers the first question, but after that it errors out with "cant create file that already exists".
docker-compose up error
[+] Running 2/0
โ Container pdfgpt-pdf-gpt-1 Created 0.0s
โ Container pdfgpt-langchain-serve-1 Created 0.0s
Attaching to pdfgpt-langchain-serve-1, pdfgpt-pdf-gpt-1
pdfgpt-langchain-serve-1 |
pdfgpt-pdf-gpt-1 | Traceback (most recent call last):
pdfgpt-pdf-gpt-1 | File "app.py", line 92, in <module>
pdfgpt-pdf-gpt-1 | demo.app.server.timeout = 60000 # Set the maximum return time for the results of accessing the upstream server
pdfgpt-pdf-gpt-1 | AttributeError: 'App' object has no attribute 'server'
pdfgpt-langchain-serve-1 | โโโโโโโโโโโโโโโโโโโโโโโโโโ ๐ Flow is ready to serve! โโโโโโโโโโโโโโโโโโโโโโโโโโ
pdfgpt-langchain-serve-1 | โญโโโโโโโโโโโโโโ ๐ Endpoint โโโโโโโโโโโโโโโโฎ
pdfgpt-langchain-serve-1 | โ โ Protocol HTTP โ
pdfgpt-langchain-serve-1 | โ ๐ Local 0.0.0.0:8080 โ
pdfgpt-langchain-serve-1 | โ ๐ Private 172.25.0.3:8080 โ
pdfgpt-langchain-serve-1 | โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
pdfgpt-langchain-serve-1 | โญโโโโโโโโโโโ ๐ HTTP extension โโโโโโโโโโโโโฎ
pdfgpt-langchain-serve-1 | โ ๐ฌ Swagger UI .../docs โ
pdfgpt-langchain-serve-1 | โ ๐ Redoc .../redoc โ
pdfgpt-langchain-serve-1 | โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
pdfgpt-langchain-serve-1 | Do you love open source? Help us improve Jina in just 1 minute and 30 seconds by
pdfgpt-langchain-serve-1 | taking our survey:
pdfgpt-langchain-serve-1 | https://10sw1tcpld4.typeform.com/jinasurveyfeb23?utm_source=jina(Set environment
pdfgpt-langchain-serve-1 | variable JINA_HIDE_SURVEY=1 to hide this message.)
pdfgpt-pdf-gpt-1 exited with code 1
It isn't working ... the seconds timer keeps going.
Am I missing something? How long should one wait for a response?
Demo not working
Hello,
Great work, only want to try.
am i the only one that cannot run the demo ? it load the pdf, i past the key then ask the question.
The result is "Error" even i tried on several PDF files.
Any suggestion ?
Thank you
create api to use it
Docker?
I would like to run this in docker for unraid. Is this something you can try?
Multiple PDF Files
What would it take to support Multiple PDF File Import and providing search capability?
Text Not Found in PDF.
Waiting gateway...
Upload PDF Display Error
Seems to be due to the issue of the PDF file being too large?What is the supported file content size?
can I make it work with only local models (no API calls)?
for privacy and money constrains
just errors
uploaded a test document, background.pdf, asked it to give feedback, returned an error
Key Error:'name'
Request: GPT-3.5-turbo
Is there any specific reason (like less hallucianation) why we are using text-davinci-003 over gpt-3.5-turbo?
It would be nice if there was an option to switch to Gpt-3.5 turbo.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.