GithubHelp home page GithubHelp logo

mmaaz60 / groundinglmm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mbzuai-oryx/groundinglmm

0.0 0.0 0.0 110.22 MB

Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks [CVPR 2024].

Home Page: https://mbzuai-oryx.github.io/groundingLMM/

Shell 0.31% Python 99.69%

groundinglmm's Introduction

GLaMM : Pixel Grounding Large Multimodal Model [CVPR 2024]

Oryx Video-ChatGPT

* Equally contributing first authors

Mohamed bin Zayed University of AI, Australian National University, Aalto University, Carnegie Mellon University, University of California - Merced, Linköping University, Google Research

Demo Website paper Full Paper video


📢 Latest Updates

  • Feb-27-23- We're thrilled to share that GLaMM has been accepted to CVPR 2024! 🎊
  • Dec-27-23- GLaMM training and evaluation codes, pretrained checkpoints and GranD-f dataset are released click for details 🔥🔥
  • Nov-29-23: GLaMM online interactive demo is released demo link. 🔥
  • Nov-07-23: GLaMM paper is released arxiv link. 🌟
  • 🌟 Featured: GLaMM is now highlighted at the top on AK's Daily Papers page on HuggingFace! 🌟

GLaMM Overview

Grounding Large Multimodal Model (GLaMM) is an end-to-end trained LMM which provides visual grounding capabilities with the flexibility to process both image and region inputs. This enables the new unified task of Grounded Conversation Generation that combines phrase grounding, referring expression segmentation, and vision-language conversations. Equipped with the capability for detailed region understanding, pixel-level groundings, and conversational abilities, GLaMM offers a versatile capability to interact with visual inputs provided by the user at multiple granularity levels.


🏆 Contributions

  • GLaMM Introduction. We present the Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

  • Novel Task & Evaluation. We propose a new task of Grounded Conversation Generation (GCG). We also introduce a comprehensive evaluation protocol for this task.

  • GranD Dataset Creation. We create the GranD - Grounding-anything Dataset, a large-scale densely annotated dataset with 7.5M unique concepts grounded in 810M regions.


🚀 Dive Deeper: Inside GLaMM's Training and Evaluation

Delve into the core of GLaMM with our detailed guides on the model's Training and Evaluation methodologies.

  • Installation: Provides guide to set up conda environment for running GLaMM training, evaluation and demo.

  • Datasets: Provides detailed instructions to download and arrange datasets required for training and evaluation.

  • Model Zoo: Provides downloadable links to all pretrained GLaMM checkpoints.

  • Training: Provides instructions on how to train the GLaMM model for its various capabilities including Grounded Conversation Generation (GCG), Region-level captioning, and Referring Expression Segmentation.

  • Evaluation: Outlines the procedures for evaluating the GLaMM model using pretrained checkpoints, covering Grounded Conversation Generation (GCG), Region-level captioning, and Referring Expression Segmentation, as reported in our paper.

  • Demo: Guides you through setting up a local demo to showcase GLaMM's functionalities.

👁️💬 GLaMM: Grounding Large Multimodal Model

The components of GLaMM are cohesively designed to handle both textual and optional visual prompts (image level and region of interest), allowing for interaction at multiple levels of granularity, and generating grounded text responses.

GLaMM Architectural Overview


🔍 Grounding-anything Dataset (GranD)

GranD dataset, a large-scale dataset with automated annotation pipeline for detailed region-level understanding and segmentation masks. GranD comprises 7.5M unique concepts anchored in a total of 810M regions, each with a segmentation mask.

Dataset Annotation Pipeline


Below we present some examples of the GranD dataset.

GranD Dataset Sample

GranD Dataset Sample


📚 Building GranD-f for Grounded Conversation Generation

GranD-f is designed for the GCG task, with about 214K image-grounded text pairs for higher-quality data in fine-tuning stage.

GranD-f Dataset Sample


🤖 Grounded Conversation Generation (GCG)

Introducing GCG, a task to create image-level captions tied to segmentation masks, enhancing the model’s visual grounding in natural language captioning.

Results_GCG

GCG_Table


🚀 Downstream Applications

🎯 Referring Expression Segmentation

Our model excels in creating segmentation masks from text-based referring expressions.

Results_RefSeg

Table_RefSeg


🖼️ Region-Level Captioning

GLaMM generates detailed region-specific captions and answers reasoning-based visual questions.

Results_RegionCap

Table_RegionCap


📷 Image Captioning

Comparing favorably to specialized models, GLaMM provides high-quality image captioning.

Results_Cap


💬 Conversational Style Question Answering

GLaMM demonstrates its prowess in engaging in detailed, region-specific, and grounded conversations. This effectively highlights its adaptability in intricate visual-language interactions and robustly retaining reasoning capabilities inherent to LLMs.

Results_Conv


Results_Conv


📜 Citation

  @article{hanoona2023GLaMM,
          title={GLaMM: Pixel Grounding Large Multimodal Model},
          author={Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S.},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }

🙏 Acknowledgement

We are thankful to LLaVA, GPT4ROI, and LISA for releasing their models and code as open-source contributions.


groundinglmm's People

Contributors

hanoonar avatar mmaaz60 avatar sahalshajim avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.