MonoLS: Multi-Scale Feature Fusion and Spatially-Aware Attention for Monocular 3D Object Detection
3D object detection plays a pivotal role in facilitating comprehensive scene understanding in autonomous driving systems. One of its key challenges is to achieve accurate perception in complex environments. Compared with LiDAR systems and stereo-vision approaches, monocular camera-based solutions are more cost-effective and easier to deploy. However, the absence of depth in monocular images hinders the accurate localization of 3D bounding boxes when only monocular images are used. This work proposes MonoLS, a monocular 3D object detection framework that incorporates lightweight multi-scale feature fusion and spatially-aware attention. It aims to address the challenge of missing depth information while achieving precise object localization. First, lightweight multi-scale feature fusion combines deep and shallow features. This design allows for effective multi-scale feature extraction without compromising real-time detection capabilities. Second, spatially-aware attention employs a dual-branch structure, with the spatial branch using a triplet attention to capture spatial details, and the context branch aggregating global context information through global attention. These two branches are subsequently fused to produce enhanced feature representations that preserve spatial distribution and semantic richness. Finally, experiments on the KITTI dataset demonstrate that our method outperforms the baseline, achieving a real-time inference speed of up to 67 FPS.
Added 2026-04-21