面向人-机-环境共融的具身增强现实拆解系统

doi:10.3901/JME.2025.15.285

摘要/Abstract

摘要： 在人机协作拆解中，制造系统依赖预设算法的固定感知-认知模式，难以适配操作员基于经验的灵活需求与动态环境变化，易导致机器人的路径规划失效和决策停滞。对此，提出面人-机-环境共融的具身增强现实拆解系统。系统基于具身智能理论，以“感知-认知-执行”机制为核心，结合增强现实技术强化具身代理的环境感知与认知推理能力。设计一种具身增强现实协同拆解策略；提出基于上下文强化机制的局部图像注意力模型，实现自适应图像描述生成；设计基于大语言模型的调优和推理机制进行自优化的认知推理方法；设计基于增强现实的人-机-环境数据交互方法，实现机器人的操控执行。构建三种相似性指标用于进行具身感知-认知的性能评估。定量和定性的实验可证明系统的可行性和有效性。

关键词: 人机协同拆解, 具身智能, 增强现实, 图像描述, 大语言模型

Abstract: In human-robot collaborative disassembly, manufacturing systems predominantly rely on fixed perception-cognition paradigms governed by pre-established algorithms. This reliance poses significant challenges in accommodating the flexible requirements of operators, which are inherently informed by experience and the dynamic collaborative environment. As a result, robotic path planning often fails, and decision-making stalls. To address this, an embodied augmented reality disassembly system for human-robot-environment integration is proposed. The system is grounded in embodied intelligence theory and features a "perception-cognition-execution" mechanism. By combining this mechanism with augmented reality technology, it enhances environmental perception and cognitive reasoning. A collaborative disassembly strategy for embodied augmented reality is designed; a local image attention model with context enhanced mechanism to generate adaptive image captioning; a self-optimizing cognitive reasoning method is developed by large language model tuning and inference mechanism; a robotic manipulation method is developed through augmented reality-based human-robot-environment data interaction. Three similarity metrics are constructed to evaluate the performance of embodied perception and cognition. Quantitative and qualitative experiments demonstrate the system’s feasibility and effectiveness in enhancing human-robot collaborative disassembly efficiency and adaptability.

Key words: human-robot collaborative disassembly, embodied intelligence, augmented reality, image captioning, large language model

中图分类号:

TP391

吕健豪, 司佳辉, 鲍劲松. 面向人-机-环境共融的具身增强现实拆解系统[J]. 机械工程学报, 2025, 61(15): 285-296.

Lü Jianhao, SI Jiahui, BAO Jingsong. Embodied Augmented Reality Disassembly System for Human-robot-environment Integration[J]. Journal of Mechanical Engineering, 2025, 61(15): 285-296.

参考文献

[1] 杨赓,周慧颖,王柏村.数字孪生驱动的智能人机协作:理论、技术与应用[J].机械工程学报, 2022, 58(18):279-291. YANG Geng, ZHOU Huiying, WANG Baicun. Digital twin-driven smart human-machine collaboration:Theory, enabling technologies and applications[J]. Journal of Mechanical Engineering, 2022, 58(18):279-291.
[2] WANG B, ZHENG P, YIN Y, et al. Toward human-centric smart manufacturing:A human-cyber-physical systems (HCPS) perspective[J]. Journal of Manufacturing Systems, 2022, 63:471-490.
[3] 马南峰,姚锡凡,陈飞翔,等.面向工业5.0的人本智造[J].机械工程学报, 2022, 58(18):88-102. MA Nanfeng, YAO Xifan, CHEN Feixiang, et al. Human-centric smart manufacturing for industry 5.0[J]. Journal of Mechanical Engineering, 2022, 58(18):88-102.
[4] 鲍劲松,张荣,李婕,等.面向人-机-环境共融的数字孪生协同技术[J].机械工程学报, 2022, 58(18):103-115. BAO Jingsong, ZHANG Rong, LI Jie, et. al. Digital-twin collaborative technology for human-robot-environment integration[J]. Journal of Mechanical Engineering, 2022, 58(18):103-115.
[5] 李浩,王昊琪,李琳利,等."人-机-环境"共融的工业数字孪生系统智能优化方法[J].计算机集成制造系统, 2024, 30(5):1551. LI Hao, WANG Haoqi, LI Linli, et. al. Intelligent optimization method of "human-machine-environment" [J]. Computer Integrated Manufacturing Systems, 2024, 30(5):1551.
[6] ZHANG R, LV J, BAO J, et al. A digital twin-driven flexible scheduling method in a human-machine collaborative workshop based on hierarchical reinforcement learning[J]. Flexible Services and Manufacturing Journal, 2023, 35(4):1116-1138.
[7] CHU C H, WENG C Y. Experimental analysis of augmented reality interfaces for robot programming by demonstration in manufacturing[J]. Journal of Manufacturing Systems, 2024, 74:463-476.
[8] 郑湃,李成熙,殷悦,等.增强现实辅助的互认知人机安全交互系统[J].机械工程学报, 2023, 59(6):173-184. ZHENG Pai, LI Chengxi, YIN Yue, et al. Augmented reality-assisted mutual cognitive system for human-robot interaction safety concerns[J]. Journal of Mechanical Engineering, 2023, 59(6):173-184.
[9] GENG L, YIN J. ViewInfer3D:3D visual grounding based on embodied viewpoint inference[J]. IEEE Robotics and Automation Letters, 2024, 9(9):7469-7476.
[10] LANG Y, LI Z, LI Z, et al. A motion control approach for physical human-robot-environment interaction via operational behaviors inference[J]. IEEE Transactions on Industrial Electronics, 2024, 72(4):3937-3947.
[11] XING X, BURDET E, SI W, et al. Impedance learning for human-guided robots in contact with unknown environments[J]. IEEE Transactions on Robotics, 2023, 39(5):3705-3721.
[12] LIU X, LIU Y, LIU Z, et al. Unified humanrobot-environment interaction control in contact-rich collaborative manipulation tasks via model-based reinforcement learning[J]. IEEE Transactions on Industrial Electronics, 2022, 70(11):11474-11482.
[13] CHEN H, HOU L, WU S, et al. Augmented reality, deep learning and vision-language query system for construction worker safety[J]. Automation in Construction, 2024, 157:105158.
[14] LIU C, ZHANG X, CHANG F, et al. Traffic scenario understanding and video captioning via guidance attention captioning network[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 25(5):3615-3627.
[15] GAO X, GAO Q, GONG R, et al. Dialfred:Dialogue-enabled agents for embodied instruction following[J]. IEEE Robotics and Automation Letters, 2022, 7(4):10049-10056.
[16] LIU X, GUO D, ZHANG X, et al. Heterogeneous embodied multi-agent collaboration[J]. IEEE Robotics and Automation Letters, 2024, 9(6):5377-5384.
[17] SANG H, JIANG R, WANG Z, et al. Scene augmentation methods for interactive embodied AI tasks[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72:1-11.
[18] FAN H, LIU X, FUH J Y H, et al. Embodied intelligence in manufacturing:Leveraging large language models for autonomous industrial robotics[J]. Journal of Intelligent Manufacturing, 2024, 36(2):1-17.
[19] TAN S, GE M, GUO D, et al. Knowledge-based embodied question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10):11948-11960.
[20] LIU C, TANG D, ZHU H, et al. An augmented reality-assisted interaction approach using deep reinforcement learning and cloud-edge orchestration for user-friendly robot teaching[J]. Robotics and Computer-Integrated Manufacturing, 2024, 85:102638.
[21] XIE J, LIU Y, WANG X, et al. A new XR-based humanrobot collaboration assembly system based on industrial metaverse[J]. Journal of Manufacturing Systems, 2024, 74:949-964.
[22] ESWARAN M, KUMAR I, TAMILARSAN K, et al. Optimal layout planning for human robot collaborative assembly systems and visualization through immersive technologies[J]. Expert Systems with Applications, 2024, 241:122465.
[23] FU J, PECORELLA M, IOVENE E, et al. Augmented reality and human-robot collaboration framework for percutaneous nephrolithotomy:System design, implementation, and performance metrics[J]. IEEE Robotics&Automation Magazine, 2024, 31(3):25-37.
[24] DENG W, LIU Q, ZHAO F, et al. Learning by doing:A dual-loop implementation architecture of deep active learning and human-machine collaboration for smart robot vision[J]. Robotics and Computer-Integrated Manufacturing, 2024, 86:102673.
[25] BLANKMYEYER S, WENDORFF D, RAATZ A. A hand-interaction model for augmented reality enhanced human-robot collaboration[J]. CIRP Annals, 2024, 73(1):17-20.
[26] FEDDOUL Y, RAGOT N, DUVAL F, et al. Exploring human-machine collaboration in industry:A systematic literature review of digital twin and robotics interfaced with extended reality technologies[J]. The International Journal of Advanced Manufacturing Technology, 2023, 129(5):1917-1932.
[27] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1):87-110.
[28] CHURCH K. Word2Vec[J]. Natural Language Engineering, 2017, 23(1):155-162.
[29] HAN K, XIAO A, WU E, et al. Transformer in transformer[J]. Advances in Neural Information Processing Systems, 2021, 34:15908-15919.
[30] LI J, LI D, XIONG C, et al. Blip:Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International Conference on Machine Learning. PMLR, 2022:12888-12900.
[31] LI J, LI D, SAVARESE S, et al. Blip-2:Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International Conference on Machine Learning. PMLR, 2023:19730-19742.
[32] LIU C, SUN K, ZHOU Q, et al. CPMI-ChatGLM:Parameter-efficient fine-tuning ChatGLM with Chinese patent medicine instructions[J]. Scientific Reports, 2024, 14(1):6403.
[33] HOFFMANN J, BORGEAUD S, MENSCH A, et al. Training compute-optimal large language models[J]. arXiv preprint arXiv:2203.15556, 2022.
[34] ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023.
[35] WANG P, BAI S, TAN S, et al. Qwen2-vl:Enhancing vision-language model's perception of the world at any resolution[J]. arXiv preprint arXiv:2409.12191, 2024.
[36] ZHENG Y, ZHANG R, ZHANG J, et al. Llamafactory:Unified efficient fine-tuning of 100+language models[J]. arXiv preprint arXiv:2403.13372, 2024.