GithubHelp home page GithubHelp logo

kb-vqa's Introduction

KBVQA (Knowledge-based VQA)

Ask an image and get answers that go beyond what's visible.

1. Introduction

Visual Question Answering (VQA) is a task that involves providing accurate answers to questions based on the content of an image. However, most VQA approaches are limited to the information present in the image itself. They cannot infer answers that require knowledge beyond what is immediately observable. This limitation is a significant challenge, attracting many researchers who aim to enhance the accuracy and reasoning capabilities of VQA models from their diverse knowledge base.

In this project, I developed a simple web application called KBVQA (Knowledge-based Visual Question Answering), which leverages the BLIP-2 model, utilizing the Flan T5-xxl large language model, to provide answers beyond the visible details in an image.

2. Literature review

For evaluating the ability of different methods to tackle questions that require outside knowledge, OK-VQA (Outside Knowledge VQA) is widely used. The image below shows the results of existing systems with OK-VQA as the evaluation dataset (from paper PROMPTCAP: Prompt-Guided Image Captioning for VQA with GPT-3 by Hu et al.)

It can be observed that several types of image representation including caption and feature are leveraged, combined with LLMs' knowledge base or even Wikidata to enhance the capability to retrieve outside knowledge. However, in order to deploy to a light-weight machine (without GPUs), BLIP-2 VIT-G Flan T5-xxl is my choice, because it is supported by LAVIS library and easy to implement.

3. Technical Overview

3.1. Model

BLIP-2, introduced in the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Li et al., consists of three components: a CLIP-like image encoder, a Querying Transformer (Q-Former), and a large language model. The image encoder and language model are initialized from pre-trained checkpoints and kept frozen during training. The Q-Former, a BERT-like Transformer encoder, maps a set of "query tokens" to query embeddings, bridging the gap between the image encoder and the language model. The model's goal is to predict the next text token based on the query embeddings and previous text.

3.2. Backend

The backend of the KBVQA application is built with FastAPI, a modern web framework for building APIs with Python. FastAPI allows easy setup and deployment of API endpoints, handling image processing and interaction with the BLIP-2 model to generate answers based on user queries.

3.3. Frontend

The frontend of the KBVQA application is developed using Vite and ReactJS. Vite is a build tool that provides a faster and leaner development experience for modern web projects, while ReactJS is a popular JavaScript library for building user interfaces. This combination allows for a responsive and interactive user experience, enabling users to upload images, ask questions, and receive answers seamlessly.

3.4. Packaging

The KBVQA application is packaged using Docker and Docker Compose. Docker automates the deployment of applications in containers, while Docker Compose manages multi-container applications. This setup ensures easy and consistent deployment across different environments.

4. Installation

There are 2 separate applications: one for backend and one for frontend. After cloning this repo, you have to cd into each directory (backend and frontend) to install each application.

Warning: Due to high memory consumed by the model BLIP-2 Flan T5-xxl (~50GB), it is recommended that the backend app be installed on a virtual machine (Google Cloud, AWS, DigitalOcean, etc.) with a large amount of memory. On the other hand, the lightweight frontend app can be easily installed on a personal computer.

1. Install Docker and Docker Compose (both machines)

If they are already installed then skip this step.

2. Clone the repository (both machines)

   git clone https://github.com/minhquanBr2/kb-vqa.git

3. Build and run backend app

   cd backend
   docker-compose build --no-cache
   docker-compose up

The backend app will be ran on port 8001.

4. Build and run frontend app

   cd frontend

After this step, remember the add .env below the frontend directory, replace the <SERVER_IP> with your own server IP.

   VITE_REACT_APP_SERVER_URL=http://<SERVER_IP>:8001
   docker-compose build --no-cache
   docker-compose up

We can use the app at http://localhost:8080. Please make sure that the machine hosting the frontend app has access to the backend app via the network or SSH.

Note: Add sudo before the shell command if you receive Permission denied error.

5. Usage

Below is my application's user interface.

The app supports 3 methods of sending images:

  • Drag-and-drop
  • Browse from file
  • Paste URL

After being uploaded, the image is displayed with full resolution, with an editable text box for entering the question. Click Submit until the sample answer (or previous answer) is replaced by the current one. Note that Submit button is disabled while the query is being processed.

Here are some sample queries:

6. References

kb-vqa's People

Contributors

minhquanbr2 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.