• CN: 11-2187/TH
  • ISSN: 0577-6686

Journal of Mechanical Engineering ›› 2025, Vol. 61 ›› Issue (23): 58-74.doi: 10.3901/JME.2025.23.058

Previous Articles     Next Articles

Chain-of-thought Paradigm Text-based Multimodal Intelligent Agent for Equipment Operations and Maintenance

HUANG Jinfeng1, WANG Chengcheng2, HE Hongliang1,3, WANG Xu4, LI Qi1, YANG Kangding3, WANG Kai2, ZHANG Feibin1, QIN Zhaoye1, CHU Fulei1   

  1. 1. Department of Mechanical Engineering, Tsinghua University, Beijing 100084;
    2. Standards and Testing Center, Instrumentation Technology and Economy Institute, Beijing 100055;
    3. FreqX Intelligence Technology Co., Ltd., Changzhou 213162;
    4. Intelligent Game and Decision Lab, Beijing 100091
  • Received:2025-01-09 Revised:2025-09-06 Online:2025-12-05 Published:2026-01-22

Abstract: An artificial intelligence architecture—the chain-of-thought (CoT) paradigm text-based multimodal intelligent agent—for operation and maintenance (O&M) of mechanical equipment is proposed. Firstly, to address the challenge of constructing high-quality, large-scale monitoring data-to-fault mode mapping datasets in real-world engineering applications, a chain-of-thought dataset construction strategy integrating monitoring signals, mathematical features, text descriptions, and fault mode is proposed. Based on this, a signal-to-text (Sig2Txt) model driven by a signal-text data generator is developed. Subsequently, a high-quality specialized textual dataset for O&M in the mechanical equipment domain is created, and an intelligent O&M-specialized large language model is established through instruction fine-tuning on a general large language model. Finally, by organically integrating the above models based on large model intelligent agent technology and guided by the operational thinking patterns of human experts in equipment maintenance, a chain-of-thought paradigm text-based multimodal intelligent agent for intelligent O&M is formed. Testing results indicate that this model can achieve chain-of-thought parsing and mapping from multimodal input decision-making, with an accuracy exceeding 70% on ISO Category III vibration analyst test questions, thus reaching expert-level performance. In evaluations with engineering cases and publicly available multimodal datasets, the proposed model outperforms existing mainstream large models. More importantly, owing to the proposed low-cost, high-quality multimodal CoT large-scale dataset construction framework, and the unique advantages of a “text-based” approach in terms of encompassing knowledge, high-level abstraction, and interpretability, the model shows considerable scalability and development potential.

Key words: intelligent operation and maintenance, multimodal model, large language model, fault diagnosis, intelligent agent

CLC Number: