EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Project page | Paper | VIMI |

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang.ICRA 2024.

This repository contains the official Pytorch implementation of training & evaluation code and the pretrained models for EMIFF/VIMI.

Abstract

In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: $1)$ inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; $2)$ information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.

Methods

Get Started

Benchmark and Model Zoo

Modality:Image

Fusion	Method	Dataset	AP-3D (IoU=0.5)	AP-BEV (IoU=0.5)	Config	DownLoad
Only-Veh	ImvoxelNet	VIC-Sync	7.29	8.85	config	\
Only-Inf	ImvoxelNet	VIC-Sync	8.66	14.41	config	\
Late-Fusion	ImvoxelNet	VIC-Sync	11.08	14.76	\	\
Early-Fusion	BEVFormer_S	VIC-Sync	8.80	13.45	config	model/log
Early-Fusion	ImVoxelNet	VIC-Sync	12.72	18.17	config	model/log
Intermediate-Fusion	EMIFF	VIC-Sync	15.61	21.44	config	model/log

We evaluate Only-Veh/Only-Inf/Late-Fusion model following OpenDAIRV2X.

Acknowledgement

This project is not possible without the following codebases.

Citation

If you find our work useful in your research, please consider citing:

@misc{wang2023vimi,
      title={VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection}, 
      author={Zhe Wang and Siqi Fan and Xiaoliang Huo and Tongda Xu and Yan Wang and Jingjing Liu and Yilun Chen and Ya-Qin Zhang},
      year={2023},
      eprint={2303.10975},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@inproceedings{wang2024emiff,
      title={EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection}, 
      author={Zhe Wang and Siqi Fan and Xiaoliang Huo and Tongda Xu and Yan Wang and Jingjing Liu and Yilun Chen and Ya-Qin Zhang},
      booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
      year = {2024}}
}

deepphysicvision / emiff Goto Github PK

emiff's Introduction

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Project page | Paper | VIMI |

Abstract

Methods

Get Started

Benchmark and Model Zoo

Acknowledgement

Citation

emiff's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs