GithubHelp home page GithubHelp logo

querypdf-ai's Introduction

QueryPDF AI

QueryPDF AI

Overview

A simple project that uses the OpenAI API to answer questions about specific PDF files and summarize their content. It uses the Langchain library to integrate the OpenAI API with the Chroma vector database and handle queries.

Requirements

  • Docker installed on the system.

Installation and Setup

  • Obtain an OpenAI API key from OpenAI Platform.
  • In the backend project directory, run cp .env.example .env.
  • Add the OpenAI API key to the .env file to OPENAI_API_KEY.
  • Set CHROMA_HOST in the .env file to chroma for container compatibility.
  • Run docker-compose up -d --build to start the services.

Note: if you are on Windows and get this error ./entrypoint.sh: no such file or directory consider cloning the repo with the following config

git clone [email protected]:Dawsoncodes/querypdf-ai.git --config core.autocrlf=false

This setup will initiate three services:

API Endpoints

  • POST /chat: Engage with the AI assistant. Use the query parameter file to specify a particular file.
  • POST /summarize: Summarize the contents of a file. Use the query parameter file to specify the file.
  • GET /_health: Checks backend readiness, crucial for the frontend.

Usage

  • After setup, the UI will be accessible for interacting with the AI assistant and summarizing documents.
  • Use the API endpoints to interact programmatically with the services.

Important Notes

If you want to understand a bit about the code and how is it, please read this section.

  • save_embeddings.py This function is used to save the embeddings of the documents in the database. It uses the Chroma vector database to save the embeddings of the documents.

    The function runs only once, when the container is built and the backend server is started, the flask server will not immediately start, it will wait for the embeddings to be saved in the database, due to the large number of documents and response times of the OpenAI API, this process can take a long time, so please be patient.

    The function is handled smartly to only add new documents to the database, so you don't have to worry about duplicate documents even if you shut down the containers in the middle of the process.

    The function uses another function format_docs.py to format the documents in a way that is suitable for the Chroma vector database. There is a lot of improvments that can be done to this function, but this is the minimal solution that I could come up with.

    After the format_docs.py function is done, the save_embeddings.py function will start saving the embeddings in the database.

  • Reset the chroma database: If you want to reset the chroma database, you can do so by editing the entrypoint.sh file and adding --reset to the python3 save_embeddings.py command, then rebuild the containers using docker-compose up -d --build.

    python3 save_embeddings.py --reset
  • entrypoint.sh This script is used to run the save_embeddings.py function and start the flask server, it will do a few things:

    • The chroma server doesn't start right away, it will install some dependencies after the image is built and ran, so the script will continuously check if the chroma server is ready to accept connections. After the chroma server is ready, it will start the save_embeddings.py function.
    • After the save_embeddings.py function is done, it will start the flask server using gunicon.

The frontend ui

The frontend is written in NextJS it is a simple project with 2 pages:

  • The home page: From this page the user will select the document that they want to ask questions about or summarize.
  • The chat page: From this page the user will be able to ask questions about the document and get answers from the AI assistant, they will also be able to summarize the document.

The APIs used by the frontend

There are 3 APIs that are being used by the NextJS frontend:

  • GET /_health: This API is used to check if the backend is ready to accept connections, it will ping the backend every 1 second until it gets a success response, then it will show the contents of the page.
  • POST /chat: This API is used to ask questions about the document, it will send the question to the backend and get the answer from the AI assistant.
  • POST /summarize: This API is used to summarize the document, it will send the document to the backend and get the summary from the AI assistant.

What can be improved?

There are a lot of things that can be improved in this project, but here are some of the most important ones:

  • Improve the formatting of the documents: The format_docs.py function can be improved a lot, when the documents are formatted, you can see in some of the documents contain a lot of unnecessary text and unmeaningful numbers, with a better script, we can remove all of that and only keep the important text.

  • Making the AI bot answer more intelligently: Currently, the AI bot will only answer the question if it's in the document, but it doesn't understand the context of the question, so if you ask it a question that is not in the document, it will still try to answer it, we can make this better by having multiple models, and use different models based on the question, for example, if the question is about a specific document, we can use the model that is trained on that document, and if the question is about a general topic, we can use a model that is trained on a general topic.

  • Using streaming for better user experience: Currently, the flask server will wait for the OpenAI API to get back the response, this can take a long time, and the user will have to wait for the response, we can improve this by using streaming, so the user can see the response as soon as it's ready.

  • Dynamic PDF file upload: Currently, there are only 3 documents which are all a part of the text_segments.csv file, we can improve this by allowing the user to upload their own PDF files and then format them and generate embeddings for them so they can be used by the AI assistant to answer questions and summarize the documents.

  • Improve the summarization: The summarization takes a lot of time, a better way to do this is to keep it as a background job and notify the user when it's done, and be able to see the same summarization again without having to generated again from scratch.

  • Save embeddings in parallel: Currently, the embeddings are saved one by one, we can improve this by saving them in parallel, this will reduce the time it takes to save the embeddings.

License

This project is licensed under the MIT License.

querypdf-ai's People

Contributors

dcodesdev avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.