多模态多序列高效越野可行驶区域检测

doi:10.3901/JME.2025.10.276

摘要/Abstract

摘要： 可行驶区域检测对自动驾驶轨迹预测和路径规划起着重要支撑作用，但越野环境的复杂性和边界的不规则性限制其检测精度、实时性和泛化性的提高。针对以上挑战，提出一种多模态多序列高效可行驶区域检测网络。该网络融合了Transformer和CNN的优势，通过自适应空间和时间门控单元实现RGB图像与激光雷达数据的多模态多序列信息的融合与增强，并利用M-IDA模块细化处理输出特征，进一步提高检测精度与泛化性。此外，采用线性Transformer编码和深度可分离卷积降低计算复杂度，实现高效推理。基于ORFD数据集的四组对比试验结果表明，相较于基线模型，检测精确、F1得分和交并比分别提高了2.3%、1.2%和2%，推理时间降低了40.8%。此外，消融试验验证了各部件的有效性，ORFD测试集、KITTI Road数据集和实车实测数据进一步验证了该网络的准确性、高效性和泛化性。

关键词: 深度学习, 越野环境, 可行驶检测, 多模态, 多序列

Abstract: Freespace detection plays a crucial role in autonomous trajectory prediction and path planning; however, the complexity of off-road environments and the irregularity of navigable boundaries limit improvements in detection accuracy, real-time performance, and generalization. To address these challenges, a multimodal multi-sequence efficient freespace detection network is proposed. This network leverages the strengths of Transformer and CNN, integrating a spatiotemporal adaptive gating unit to effectively fuse and enhance multimodal multi-sequence information from RGB images and LiDAR data. Additionally, an M-IDA module is incorporated to refine output features, further improving detection accuracy and generalization. Moreover, linear Transformer encoding and depthwise separable convolutions are employed to reduce computational complexity and achieve efficient inference. Four comparative experiments on the ORFD dataset demonstrate that, compared to baseline models, the proposed network achieves 2.3%, 1.2%, and 2% improvements in accuracy, F1-score, and IoU, respectively, while reducing inference time by 40.8%. Furthermore, ablation studies validate the effectiveness of each module, and additional evaluations on the ORFD test set, KITTI road dataset, and real-vehicle experiments further confirm the network’s generalization accuracy, efficiency and capability.

Key words: deep learning, off-road, freespace detection, multi-modal, multi-sequence

中图分类号:

U495

李路兴, 魏超, 胡乐云, 随淑鑫, 徐扬, 钱歆昊. 多模态多序列高效越野可行驶区域检测[J]. 机械工程学报, 2025, 61(10): 276-287.

Li Luxing, Wei Chao, Hu Leyun, Sui Shuxin, Xu Yang, Qian Xinhao. Multimodal-time-series System for Off-road Freespace Efficient-detection[J]. Journal of Mechanical Engineering, 2025, 61(10): 276-287.

参考文献

[1] Teeti I，Khan S，Shahbaz A，et al. Vision-based intention and trajectory prediction in autonomous vehicles：A survey[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence，2022：5630-5637.
[2] Badue C，Guidolini R，Carneiro R V，et al. Self-driving cars：A survey[J]. Expert Systems with Applications，2021，165(1)：1-34.
[3] Geiger A，Lenz P，Stiller C，et al. Vision meets robotics：The kitti dataset[J]. The International Journal of Robotics Research，2013，32(11)：1231-1237.
[4] Caesar H，Bankiti V，Lang A H，et al. Nuscenes： A multimodal dataset for autonomous driving[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)，New York：IEEE，2020：11621-11631.
[5] Sun P，Kretzschmar H，Dotiwalla X，et al. Scalability in perception for autonomous driving：Waymo open dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)，New York：IEEE，2020：2446-2454.
[6] Chustz G，Saripalli S. ROOAD：RELLIS off-road odometry analysis dataset[C]//2022 IEEE Intelligent Vehicles Symposium (IV)，New York：IEEE，2022：1504-1510.
[7] Wigness M，Eum S，Rogers J G，et al. A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)，New York：IEEE，2019：5000-5007.
[8] Jiang P，Osteen P，Wigness M，et al. Rellis-3d dataset：Data，benchmarks and analysis[C]//2021 IEEE International Conference on Robotics and Automation (ICRA)，New York：IEEE，2021：1110-1116. [9] Min C，Jiang W，Zhao D，et al. Orfd：A dataset and benchmark for off-road freespace detection[C]//2022 International Conference on Robotics and Automation (ICRA)，2022：2532-2538.
[10] 张润生，黄小云，刘晶，等. 基于视觉复杂环境下车辆行驶轨迹预测方法[J]. 机械工程学报，2011，47(2)：16-24.Zhang runsheng，haung xiaoyun，liu jing，et al. Image vehicle motion trajectory prediction method under complex environment[J]. Journal of Mechanical Engineering，2011，47(2)：16-24.
[11] SU Jinming，CHEN Chao，ZHANG Ke，et al. Structure guided lane detection [C]//International Joint Conferences on Artificial Intelligence Organization，2021：997-1003.
[12] Long J，Shelhamer E，Darrell T. Fully convolutional networks for semantic segmentation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)，New York：IEEE，2015：3431-3440.
[13] Ronneberger O，Fischer P，Brox T. U-net：Convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention (MICCA)，2015：234-241.
[14] Chen L C，Zhu Y，Papandreou G，et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision (ECCV)，2018：801-818.
[15] Tian Z，He T，Shen C，et al. Decoders matter for semantic segmentation：Data-dependent decoding enables flexible feature aggregation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)，New York：IEEE，2019：3126-3135.
[16] Wang W，Neumann U. Depth-aware CNN for RGB-D segmentation[C]//Proceedings of the European Conference on Computer Vision (ECCV)，2018：135-150.
[17] Gu S，Zhang Y，Yang J，et al. Two-view fusion based convolutional neural network for urban road detection[C]//2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)，New York：IEEE，2019：6144-6149.
[18] Gu S，Zhang Y，Tang J，et al. Road detection through CRF based LiDAR-camera fusion[C]//2019 International Conference on Robotics and Automation (ICRA)，2019：3832-3838.
[19] Chen Z，Zhang J，Tao D. Progressive lidar adaptation for road detection[J]. IEEE/CAA Journal of Automatica Sinica，2019，6(3)：693-702.
[20] Fan R，Wang H，Cai P，et al. Sne-roadseg： Incorporating surface normal information into semantic segmentation for accurate freespace detection[C]// Proceedings of the European Conference on Computer Vision (ECCV)，2020：340-356.
[21] Wang Y，Zhu L，Huang S，et al. Cross-modality domain adaptation for freespace detection：A simple yet effective baseline[C]//Proceedings of the 30th ACM International Conference on Multimedia，2022：4031-4042.
[22] Vaswani A，Shazeer N，Parmar N，et al. Attention is all you need[C]//Advances in Neural Information Processing Systems (NeurIPS)，2017：6000-6010.
[23] Fang F，Zhou T，Song Z，et al. Mmcan：Multi-modal cross-attention network for free-space detection with uncalibrated hyperspectral sensors[J]. Remote Sensing，2023，15(4)：1-16.
[24] Shen Z，Zhang M，Zhao H，et al. Efficient attention： Attention with linear complexities[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)，New York：IEEE，2021：3531-3539.
[25] Yu X，Rawat A S，Chen J，et al. Sampled softmax with random fourier features[C]//Advances in Neural Information Processing Systems (NeurIPS)，2019：13869-13879.
[26] Zhang Q，Yang Y B. Rest：An efficient transformer for visual recognition[C]//Advances in Neural Information Processing Systems (NeurIPS)，2021：15475-15485.
[27] Liu Z，Lin Y，Cao Y，et al. Swin transformer：Hierarchical vision transformer using shifted windows[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)，New York：IEEE，2021：10012-10022.
[28] Hu A，Murez Z，Mohan N，et al. Fiery：Future instance prediction in bird's-eye view from surround monocular cameras[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)，New York：IEEE，2021：15273-15282.
[29] Zhou Q，Li X，He L，et al. End-to-end video object detection with spatial-temporal transformers[C]// Proceedings of the 29th ACM International Conference on Multimedia，2021：1507-1516.
[30] Katharopoulos A，Vyas A，Pappas N，et al. Transformers are rnns：Fast autoregressive transformers with linear attention[C]//International conference on machine learning (ICML)，2020：5156-5165.
[31] LIU Zhuang，MAO Hanzi，WU Chaoyuan，et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)，New York：IEEE，2022：11976-11986.
[32] Wang Yujie，Mike I. The tree loss：Improving generalization with many classes[C]//Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS)，2022：6121-6133.
[33] ZHENG Sixiao，LU Jiachen，ZHAO Hengshuang，et al. Rethinking semantic segmentation from a sequence-to- sequence perspective with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York：IEEE，2021：6881-6890.
[34] CHENG B，SCHWING A，KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation[C]// Advances in Neural Information Processing Systems (NeurIPS)，2021：17864-17875.
[35] LI Z，WANG W，LI H，et al. Bevformer：Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers[C]//Proceedings of the European Conference on Computer Vision (ECCV)，2022：1-18.