GithubHelp home page GithubHelp logo

shaikhmubin02 / recognize-ai Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 602 KB

Gemini demo using GPT-vision.

Home Page: https://recognize-ai.vercel.app

License: MIT License

JavaScript 80.60% CSS 2.53% TypeScript 16.87%

recognize-ai's Introduction

Recognize-AI

This project is an experimental exploration of real-time AI interactions, inspired by the "Gemini" video showcased by Google. The aim is to create a system where users can interact with an AI assistant in real-time, similar to the interactions portrayed in the "Gemini" video.

Demo

recognize-demo

Overview

The project leverages the GPT-4 Vision API to achieve real-time AI interactions. The primary goal is to enable users to stream video input to the assistant, allowing them to ask questions and receive responses without directly interacting with the UI.

Technical Constraints

  • Real-Time Interaction: The system must support real-time interactions between the user and the AI assistant.
  • Video Streaming: Users should be able to stream a video as input to the assistant.
  • Voice Interaction: Users must be able to communicate with the assistant verbally without interacting with the UI.
  • Video Analysis: The assistant should analyze the video input to understand the user's questions.
  • Speech Response: The assistant must respond verbally to the user's queries.

Implementation

Video Streaming

  • To address the challenge of streaming video input, a grid of screenshots is created from the video at regular intervals. This grid is then treated as a single image representing the video stream, allowing for easier processing.

AI Integration

  • The GPT-4 Vision API is utilized for reasoning about user queries based on the video input. The system prompt is fine-tuned to emphasize the temporal nature of the images, enabling the assistant to reason using the sequence of images.

Voice Interaction

  • User speech is detected, transcribed using Whisper, and sent to the Whisper API for processing. The result is then passed through a text-to-speech API to generate verbal responses from the assistant.

Stack

  • Next.js with App Router.
  • Vercel AI npm module.
  • OpenAI's Whisper and GPT APIs.

Limitations

  • Speed: The system is slower compared to the Gemini demo, as it requires waiting for audio transcription and processing.
  • Accuracy: The assistant may struggle with complex questions and occasionally provide incorrect answers.
  • Turn-Based Interaction: Unlike the Gemini demo, interactions are turn-based, with the user speaking followed by the assistant's response.
  • Image Quality: Low-quality images may result in hallucinations or difficulty for the assistant to provide accurate responses.

Conclusion

While the experiment may not fully replicate the experience showcased in the Gemini video, it serves as a fun exploration of real-time AI interactions. There is room for improvement in terms of speed, accuracy, and naturalness of interactions.

recognize-ai's People

Contributors

shaikhmubin02 avatar

Stargazers

Mohammed Mukhi avatar Saurabh avatar Cameron King avatar

Watchers

 avatar mubin  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.