LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera

¹Zhejiang University
, ²Shanghai Artificial Intelligence Laboratory
, ³Technical University of Munich

Abstract

Semantic Scene Completion is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal distillation to achieve balanced performance across all aspects. In terms of model architecture, we build upon our radar-based baseline and propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. On this basis, we design three cross-modal distillation modules (CMRD, BRD, and PDD) to supplement the rich semantic and structural information of the fusion features of LiDAR-camera into the two settings of radar-only and radar-camera, respectively, to obtain our R-LiCROcc and RC-LiCROcc. Finally, our LC-Fusion (teacher model), R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.

Method

Our motivation for proposing LiCROcc. We explore the performance of radar on SSC tasks with the expectation of developing a network with balanced performance and robustness. We further improve the semantic occupancy prediction and distance measurement capabilities of radar sensors by cross-modal distillation while maintaining their inherent night vision capabilities and weather robustness. In the figure on the right, the blue line represents the capability of the teacher model, the remaining solid line represents the capability of the pre-distillation student in our two settings, and the dashed line represents the equilibrium performance of the model after KD. The graphs on the bottom demonstrates the enhancement of our method after gradually adding modules from the base model, and the comparison with other methods on nuScenes-Occupancy (validation set), including Co-Occ, RC-CoNet and PointOcc.

Overall framework of our LiCROcc. We designed a base framework for point cloud and image fusion, where the models serve as the teacher and student, respectively. We unified both fusion processes under BEV space to reduce computational cost. Additionally, we designed three novel distillation losses (CMRD, BRD, and PDD) to achieve effective cross-modal KD. Inference is performed using only the RC fusion Network, where camera input is optional.

Detailed illustration of the use of CMRD and BRD on BEV features. The feature with the blue background is a BEV feature copied from the teacher model. The details of the loss computation are illustrated in the dashed box.

Experiments

Performance on nuScenes-Occupancy (validation set). We report the geometric metric IoU, semantic metric mIoU, and the IoU for each semantic class. The C,D,L,R,M denotes camera, depth, LiDAR, radar and multi-modal. For "Surround=yes", the method directly predicts surrounding semantic occupancy with 360-degree inputs. Otherwise, the method produces the results of each camera view, and then concatenates them as surrounding outputs. We divide the form into three categories based on the modality of the inputs. Bold represents the best score.

We evaluate the performance of the model under various distances and weather conditions. The following conclusions are drawn: 1. Radar is more robust to light and severe weather conditions compared to LiDAR and camera. 2. Distilling the LiDAR-camera fusion network into radar can significantly improve its SSC performance. 3. In terms of distillation effect, close proximity and clear daytime conditions are the dominant zones of LC-Fusion, with the best distillation effect. 4. The experimental requirements are satisfied, given the only dataset used, the nuScenes dataset, which features less severe weather, with LiDAR almost unaffected.

Visual comparison of our LiCROcc with baseline methods on OppenOccupancuy benchmark. Our methods stand out by offering a more comprehensive representation of the scene and more accurate segmentation boundaries compared to CoOcc and PointOcc. Surprisingly, the R-LiCROcc is capable of segmenting feasible regions and obstacles even with only two thousand radar points as input.

BibTeX

@ARTICLE{10777549, author={Ma, Yukai and Mei, Jianbiao and Yang, Xuemeng and Wen, Licheng and Xu, Weihua and Zhang, Jiangning and Zuo, Xingxing and Shi, Botian and Liu, Yong}, journal={IEEE Robotics and Automation Letters}, title={LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction Using LiDAR and Camera}, year={2025}, volume={10}, number={1}, pages={852-859}, keywords={Radar;Semantics;Radar imaging;Three-dimensional displays;Laser radar;Feature extraction;Cameras;Sensors;Meteorology;Point cloud compression;Sensor fusion;semantic scene completion;knowledge distillation}, doi={10.1109/LRA.2024.3511427}}

LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction using LiDAR and Camera

Abstract

Method

Detailed illustration of the use of CMRD and BRD on BEV features. The feature with the blue background is a BEV feature copied from the teacher model. The details of the loss computation are illustrated in the dashed box.

Experiments

Our LC-Fusion: LiDAR+Camera

Our RC-LiCROcc: Radar+Camera

Our R-LiCROcc: Radar-only

Our LC-Fusion: LiDAR+Camera

Our RC-LiCROcc: Radar+Camera

Our R-LiCROcc: Radar-only

Our LC-Fusion: LiDAR+Camera

Our RC-LiCROcc: Radar+Camera

Our R-LiCROcc: Radar-only

Our LC-Fusion: LiDAR+Camera

Our RC-LiCROcc: Radar+Camera

Our R-LiCROcc: Radar-only

Comparison

BibTeX