GithubHelp home page GithubHelp logo

derrickxunu / cobevt Goto Github PK

View Code? Open in Web Editor NEW
197.0 9.0 16.0 61.43 MB

[CoRL2022] CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

License: Apache License 2.0

Python 99.11% Cython 0.89%
autonomous-driving autonomous-vehicles bev-perception collaborative-perception computer-vision multi-agent-perception nuscenes segmentation semantic semantic-segmentation

cobevt's Introduction

I am a Research Scientist at Waymo, working on the most exciting research projects in Autonomous Driving. I gained my PhD degree from UCLA in 2.5 years, with publications in CVPR/ECCV/ICCV/TPAMI/CoRL/ICRA. I was also a senior deep learning engineer at Mercedes-Benz R&D North America(MBRDNA) and a computer vision engineer at OPPO R&D US from 2018-2020.

🔭 Reseach-wise, I mainly focus on:

  • Autonomous Driving topics related to Perception, Simulation, and Cooperative Driving Automation
  • Generative AI topics related to LLM, Diffusion, and World Model.
  • Computer Vision topics related to Vision Transformer.

😄 I am open to:

  • collaboration opportunities (anytime & anywhere & any type) and
  • research internships

📫 Contact me by:

Runsheng's github stats

cobevt's People

Contributors

derrickxunu avatar vztu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cobevt's Issues

Evaluating IoU on NuScenes Dataset

I am trying to produce IoU values on the NuScenes dataset. I am new to machine learning and was hoping you could provide script to test the model on the nuscenes track. I have been able to set up the dataset and run the benchmark.py file to produce the inference speed, but I am unsure of how I can produce the IoU score included in your paper. -Thank you

Result of opv2v lidar track

Thanks for the work. Can you share how to extract the result of opv2v lidar track (are there codes to extract bounding boxes)?

Attemption of transfering to Detection task

Thank you for sharing the code. I tried to replace the Seg Head of CoBEVT with a simple one-stage detection head. But I found that it hardly work that the regress branch can not to be convergent. So I'd like to ask for you that have ever tried to use the CoBEVT to address detection task? Do you think it's reasonable to work?
I really appreciate that you can send me a reply! Thank you!

How to get the number of dropped cameras result?

Hello! I want to do the ablation study about the dropped cameras, but I have the difficulty about don't know how to change the code to drop the cameras. So could you please tell me how to randomly drop cameras? Thanks for your patient answer .<( _ _ )>

the calculation of camera_extrinsic

Hi, I noticed the function reform_camera_param in the file BaseDataset involves three variables about extrinsic parameters: camera_extrinsic, camera_extrinsic_to_ego_lidar and camera_extrinsic_to_ego.

I have the following questions:

  1. What's the difference between them?
  2. If I want to fuse the features of lidar and images, which extrinsic should I use, and should I need to change the coordinates from UE4's coordinate system to the standard camera coordinate system like project_3d_to_camera in camera_utils?

Thank you~

Clarification Needed on Fused Axial Attention in FAX module

In the local-global attention block of the CrossViewSwapAttention class, I noticed that there are two rearrange operations applied to the key tensor: From my understanding, these two operations seem to cancel each other out as they appear to reshape the key tensor first into a global feature map and then back into the original window partitioned shape. Could you help explain the purpose of these operations? Why does the key tensor need to be reshaped twice in this way?

    # local-to-local cross-attention
    query = rearrange(query, 'b n d (x w1) (y w2) -> b n x y w1 w2 d',
                      w1=self.q_win_size[0], w2=self.q_win_size[1])  # window partition
    key = rearrange(key, 'b n d (x w1) (y w2) -> b n x y w1 w2 d',
                      w1=self.feat_win_size[0], w2=self.feat_win_size[1])  # window partition
    val = rearrange(val, 'b n d (x w1) (y w2) -> b n x y w1 w2 d',
                      w1=self.feat_win_size[0], w2=self.feat_win_size[1])  # window partition
    query = rearrange(self.cross_win_attend_1(query, key, val,
                                            skip=rearrange(x,
                                                        'b d (x w1) (y w2) -> b x y w1 w2 d',
                                                         w1=self.q_win_size[0], w2=self.q_win_size[1]) if self.skip else None),
                   'b x y w1 w2 d  -> b (x w1) (y w2) d')    # reverse window to feature   全部恢复原来的形状

    query = query + self.mlp_1(self.prenorm_1(query))

    x_skip = query
    query = repeat(query, 'b x y d -> b n x y d', n=n)              # b n x y d

    # local-to-global cross-attention
    query = rearrange(query, 'b n (x w1) (y w2) d -> b n x y w1 w2 d',
                      w1=self.q_win_size[0], w2=self.q_win_size[1])  # window partition
    # Todo: 这不是相互抵消的操作吗?
    key = rearrange(key, 'b n x y w1 w2 d -> b n (x w1) (y w2) d')  # reverse window to feature
    key = rearrange(key, 'b n (w1 x) (w2 y) d -> b n x y w1 w2 d',
                    w1=self.feat_win_size[0], w2=self.feat_win_size[1])  # grid partition
    val = rearrange(val, 'b n x y w1 w2 d -> b n (x w1) (y w2) d')  # reverse window to feature
    val = rearrange(val, 'b n (w1 x) (w2 y) d -> b n x y w1 w2 d',
                    w1=self.feat_win_size[0], w2=self.feat_win_size[1])  # grid partition
    query = rearrange(self.cross_win_attend_2(query,
                                              key,
                                              val,
                                              skip=rearrange(x_skip,
                                                        'b (x w1) (y w2) d -> b x y w1 w2 d',
                                                        w1=self.q_win_size[0],
                                                        w2=self.q_win_size[1])
                                              if self.skip else None),
                   'b x y w1 w2 d  -> b (x w1) (y w2) d')  # reverse grid to feature

the function of com_mask

Hi, thanks for your work, I wonder what's the function of 'com_mask' in opv2v_track_task, and is this variable required?

The related code is
com_mask = mask.unsqueeze(1).unsqueeze(2).unsqueeze( 3) if not self.use_roi_mask \ else get_roi_and_cav_mask(x.shape, mask, transformation_matrix, self.discrete_ratio, self.downsample_rate)

What is the range of detection performance evaluation for OPV2V Camera Track?

Dear Runsheng,
Thank you for your amazing work! I have a minor question when I conduct the experiments. Could you help me address the following issue?
I know the evaluation range is [±140,±40]m for lidar track, but I am not sure whether it is suitable for the camera case in CoBEVT.
For nuScenes, the range is 100m×100m area.
Thank you for your attention!

FAX local global

为什么计算局部注意力时,需要把特征图变换成 (H/P × W/P, N × P², C) 这个形状,即将P²放在倒数第二个维度?

而计算全局注意力时,则需要把特征图变换成 (N × G², H/G × W/G, C) 这个形状,然后再交换 【倒数第二个维度】 和 【倒数第三个维度】 的顺序,即变成 (H/G × W/G, N × G², C),既然这种形式和局部形式相同,为什么不直接进行相同的变换呢,而是再去额外的交换维度?

issue of reproducing the results

Thanks for sharing the code. I trained the model using the command : CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --use_env opencood/tools/train_camera.py --hypes_yaml opencood/hypes_yaml/opcamera/corpbevt.yaml.
The IoU is about 46.2%. The result reported in the paper is about 60.4%. Do you know why my reproduced result is much lower?

How to modify

I changed the size of the input image from 512x512 to 256x256, but there was a mismatch in the qkv size. I would like to know how to modify it.
Traceback (most recent call last):
File "opencood/tools/train_camera.py", line 241, in
main()
File "opencood/tools/train_camera.py", line 152, in main
ouput_dict = model(batch_data['ego'])
File "/home/cylunbu/anaconda3/envs/cobevt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cylunbu/CoBEVT/opv2v/opencood/models/corpbevt.py", line 114, in forward
x = self.fax(batch_dict)
File "/home/cylunbu/anaconda3/envs/cobevt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cylunbu/CoBEVT/opv2v/opencood/models/sub_modules/fax_modules.py", line 513, in forward
x = cross_view(i, x, self.bev_embedding, feature, I_inv, E_inv)
File "/home/cylunbu/anaconda3/envs/cobevt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cylunbu/CoBEVT/opv2v/opencood/models/sub_modules/fax_modules.py", line 408, in forward
w1=self.q_win_size[0], w2=self.q_win_size[1]) if self.skip else None),
File "/home/cylunbu/anaconda3/envs/cobevt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cylunbu/CoBEVT/opv2v/opencood/models/sub_modules/fax_modules.py", line 208, in forward
assert q_height * q_width == kv_height * kv_width
AssertionError

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.