GithubHelp home page GithubHelp logo

xyz9911 / flame Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 2.0 4.35 MB

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments (arXiv:2408.11051)

Home Page: https://flame-sjtu.github.io

License: Apache License 2.0

large-multimodal-models multimodal-large-language-models vision-and-language-navigation vision-language-model embodied-agent

flame's Introduction

FLAME (Flamingo-Architected Embodied Agent)

This repository contains code for reproducing results. (will be released later)

๐Ÿ“– Table of Contents

๐Ÿ‘‹ Overview

Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI.

๐Ÿค– Method Details

Based on Flamingo, FLAME operates autoregressively and efficiently handles multiple perceptions without increasing context length, ensuring efficiency in end-to-end training and inference.

Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically.

๐Ÿ› ๏ธ Training and Evaluation

FLAME is implemented based on Otter and OpenFlamingo. The training is based on Deepspeed. Detailed modules of the code will be released later.

๐Ÿ“Š Performance

FLAME achieves state-of-the-art results on both the Touchdown and Map2seq datasets. The table below highlights FLAME's performance compared to previous models.

Touchdown Dataset

Model TCโ†‘ (Dev) SPDโ†“ (Dev) nDTWโ†‘ (Dev) TCโ†‘ (Test) SPDโ†“ (Test) nDTWโ†‘ (Test)
RCONCAT (2019) 10.60 20.4 22.50 11.80 20.40 22.90
GA (2019) 12.00 18.70 25.20 11.90 19.00 24.90
VLN-Trans (2021) 15.00 20.30 27.00 16.20 20.80 27.80
ARC+L2S (2020) 19.48 17.05 - 16.68 18.84 -
ORAR (2022) 30.05 11.12 45.50 29.60 11.79 45.30
VELMA (2023) 29.83 14.67 43.44 27.38 15.03 41.93
PM-VLN (2023) 33.00 23.60 - 33.40 23.80 -
VLN-Video (2024) 34.50 9.60 - 31.70 11.20 -
Loc4Plan (2024) 34.50 10.50 - 32.90 11.50 -
FLAME 41.28 9.14 55.96 40.20 9.53 54.56

Map2seq Dataset

Model TCโ†‘ (Dev) SPDโ†“ (Dev) nDTWโ†‘ (Dev) TCโ†‘ (Test) SPDโ†“ (Test) nDTWโ†‘ (Test)
RCONCAT (2019) 17.10 - 30.70 14.70 - 27.70
GA (2019) 18.20 - 33.00 17.00 - 30.10
VLN-Trans (2021) 18.60 - 31.10 17.00 - 29.50
ORAR (2022) 49.88 5.87 62.70 47.75 6.53 62.10
VELMA (2023) 52.75 6.78 66.45 48.70 6.80 62.37
Loc4Plan (2024) 48.00 7.00 - 45.30 7.20 -
FLAME 56.95 5.95 71.36 52.44 5.91 67.72

FLAME consistently outperforms prior models, proving that MLLMs can significantly outperform specialized VLN models.

Citation

If you find our research useful, please cite our paper:

@article{xu2024flame,
        title={FLAME: Learning to Navigate with Multimodal LLM in Urban Environments},
        author={Xu, Yunzhe and Pan, Yiyuan and Liu, Zhe and Wang, Hesheng},
        journal={arXiv preprint arXiv:2408.11051},
        year={2024}}

flame's People

Contributors

xyz9911 avatar

Stargazers

Wenzhao Zheng avatar Furong avatar  avatar Shuo Feng avatar  avatar Xiaobing Han avatar  avatar  avatar Amiya Shirou avatar

Watchers

 avatar

Forkers

whuhxb ywhuazhong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.