Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis

Zeng, Jinchao; Li, Zicheng; Zheng, Zuopeng; Lin, Qizhe

doi:10.3390/pr13082624

Open AccessArticle

Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis

by

Jinchao Zeng

¹,

Zicheng Li

¹,

Zuopeng Zheng

² and

Qizhe Lin

^1,*

¹

College of Mechanical and Electrical Engineering, Wenzhou University, Wenzhou 325035, China

²

Zhejiang TONGLI Transmission Technology Co., Ltd., Ruian 325205, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(8), 2624; https://doi.org/10.3390/pr13082624

Submission received: 23 July 2025 / Revised: 8 August 2025 / Accepted: 13 August 2025 / Published: 19 August 2025

(This article belongs to the Section Automation Control Systems)

Download

Browse Figures

Versions Notes

Abstract

Industrial pump systems require real-time fault diagnosis for predictive maintenance, but conventional Chain-of-Thought (COT) reasoning faces computational bottlenecks when processing high-frequency vibration data. This paper proposes Vibration-Aware COT (VA-COT), a novel framework that integrates multi-domain feature fusion (time, frequency, time–frequency) with adaptive reasoning depth control. Key innovations involve expert prior-guided dynamic feature selection to optimize edge-device inputs, complexity-aware reasoning chains reducing computational steps by 40–65% through confidence-based early termination, and lightweight deployment on industrial ARM-based single-board computers (SBCs). Evaluated on a 12-class pump fault dataset (5400 samples from centrifugal/gear pumps), VA-COT achieves 93.2% accuracy surpassing standard COT (89.3%) and CNN–LSTM (Convolutional Neural Network-Long Short-Term Memory network) (91.2%), while cutting latency to <1.1 s and memory usage by 65%. Six-month validation at pump manufacturing facilities demonstrated 35% maintenance cost reduction and 98% faster diagnostics versus manual methods, proving its viability for IIoT (Industrial Internet of Things) deployment.

Keywords:

pump fault diagnosis; vibration analysis; Chain-of-Thought Reasoning; Industrial IoT; MEMS sensors; predictive maintenance

1. Introduction

Industrial pumps are foundational components of the modern economy, serving as the workhorses for fluid transport in critical sectors such as petrochemical processing, water/wastewater treatment, power-generation plants, and manufacturing facilities. The global industrial pump market, valued at tens of billions of dollars and projected to grow steadily, underscores their economic significance. However, this ubiquity comes with a significant operational challenge: unexpected pump failures. Such failures can trigger catastrophic consequences, including costly production halts, which can amount to hundreds of thousands of dollars per hour in lost revenue, and severe safety and environmental hazards. The complexity of pump failure modes—ranging from bearing wear and impeller erosion to shaft misalignment and cavitation—further complicates maintenance efforts [1]. The incipient signs of these faults are often subtle and embedded within noisy operational data, making early detection a formidable task. Consequently, industries are increasingly shifting from reactive or scheduled maintenance to predictive Maintenance (PdM) strategies, which aim to forecast failures before they occur, thereby optimizing maintenance schedules and enhancing equipment reliability [2,3]. Central to modern PdM is the use of non-intrusive condition monitoring, with vibration analysis emerging as one of the most effective techniques due to its ability to capture rich, dynamic information about the health status of rotating machinery [4].

The field of vibration-based fault diagnosis has evolved significantly over the past decades [5]. Early approaches relied heavily on signal processing techniques to extract fault-sensitive features from raw vibration data. These techniques are broadly categorized into time-domain, frequency-domain, and time–frequency domain analyses. Time-domain analysis utilizes statistical metrics, like root mean square (RMS), peak value, and kurtosis, to track changes in signal energy and impulsiveness. While computationally simple, these methods often lack the sensitivity to distinguish between different fault types, as they do not capture frequency-specific information. Frequency-domain analysis, primarily using the Fast Fourier Transform (FFT), addresses this by revealing characteristic frequencies associated with specific faults, such as bearing defect frequencies or blade pass frequencies. However, its primary limitation is its unsuitability for non-stationary signals, which are common in industrial applications with varying speeds and loads, as FFT assumes signal periodicity and loses temporal information. To overcome this, time–frequency analysis methods, like the Wavelet Transform (WT), Wavelet Packet Transform (WPT), and Empirical Mode Decomposition (EMD) were introduced, offering a simultaneous view of a signal’s temporal and spectral characteristics [6]. Despite their power in handling non-stationary signals, these methods often introduce high computational complexity and require significant expert knowledge for parameter tuning, such as the selection of an appropriate mother wavelet, which can hinder their application in real-time systems [7].

To automate the diagnostic process and reduce reliance on manual feature engineering, the research community has progressively adopted data-driven methods. This evolution began with knowledge-based expert systems in the 1980s and 1990s, which used predefined if–then rules to mimic the reasoning of a human expert. While interpretable, these systems were brittle, difficult to maintain, and struggled with knowledge acquisition bottlenecks. The 2000s saw the rise of classical machine learning models, such as Support Vector Machines (SVMs), which demonstrated strong classification performance on non-linear data but still relied on meticulously handcrafted features extracted by domain experts [8]. The current era is dominated by deep learning (DL) models, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), which have revolutionized fault diagnosis by enabling end-to-end learning directly from raw or minimally processed data [9]. These models can automatically learn hierarchical and discriminative features, achieving state-of-the-art accuracy [10]. However, their primary drawback is a profound lack of interpretability. DL models often function as “black boxes,” providing a diagnostic conclusion without a transparent or verifiable reasoning process. This opacity erodes trust, a critical factor in high-stakes industrial environments where maintenance decisions have significant financial and safety implications, thus creating a major barrier to the widespread adoption of otherwise powerful AI solutions [11].

This critical trade-off between the accuracy of black-box models and the need for trustworthy, explainable AI (XAI) presents a significant opportunity for innovation [12]. Recently, Chain-of-Thought (CoT) reasoning, a technique originating from the field of large language models (LLMs), has shown remarkable potential in eliciting complex, step-by-step reasoning processes [13]. CoT prompting guides a model to break down a complex problem into a sequence of intermediate, logical steps, ultimately leading to a final answer. This approach not only improves performance on complex reasoning tasks but also provides an interpretable window into the model’s decision-making process, as the generated chain of thought serves as a natural explanation. We posit that this reasoning paradigm is exceptionally well-suited for industrial fault diagnosis. The diagnostic process undertaken by a human expert—observing symptoms, forming hypotheses, and sequentially verifying evidence (e.g., checking for specific frequency harmonics after detecting a temporal impulse)—is fundamentally a chain of thought. By adapting CoT to the domain of vibration analysis, we can potentially create a system that combines the automatic feature learning power of deep learning with the transparent, step-by-step logic of an expert system, thereby bridging the gap between accuracy and interpretability.

The primary motivation for this research is to develop an intelligent fault diagnosis framework that is not only highly accurate but also transparent, efficient, and practical for deployment in real-world Industrial IoT (IIoT) environments [14]. However, adapting CoT reasoning to pump vibration data presents unique technical challenges. First, CoT was originally designed for natural language tasks, and its application to high-frequency, numerical time-series data from sensors is a non-trivial problem that requires a new way of representing and processing information. Second, a deep, multi-step reasoning process can be computationally intensive, potentially conflicting with the real-time requirements of industrial monitoring. This necessitates a mechanism for adaptive computation, where the reasoning complexity is dynamically adjusted based on the difficulty of the diagnostic task. Third, to ensure the reasoning process is robust, the model must be supplied with a comprehensive set of features that capture the full spectrum of fault characteristics. This requires an effective strategy for multi-domain feature fusion, integrating information from the time, frequency, and time–frequency domains to form a holistic view of the machine’s health [15].

To address these challenges, we propose a novel framework named Vibration-Aware Chain-of-Thought (VA-COT) for intelligent pump fault diagnosis. This paper makes the following principal contributions:

We introduce the VA-COT framework, a new intelligent fault diagnosis architecture that, for the first time, adapts the Chain-of-Thought reasoning paradigm to interpret and diagnose faults from pump vibration signals.
We design a multi-domain feature fusion mechanism that synergistically combines temporal, spectral, and time–frequency characteristics to provide a rich informational basis for the reasoning engine.
We develop a novel fault-feature-guided adaptive reasoning depth control strategy. This mechanism dynamically adjusts the computational complexity of the CoT process, ensuring both high accuracy for complex faults and low latency for simple cases, making it suitable for real-time applications.
We implement and validate an end-to-end system architecture optimized for lightweight deployment in resource-constrained IIoT environments, demonstrating its practical viability.
We construct and will make publicly available a comprehensive, real-world pump vibration dataset encompassing 12 distinct fault categories, which serves as a robust benchmark for evaluating the proposed method and facilitating future research in the field.

2. Related Work

2.1. Evolution of Vibration-Based Fault Diagnosis Techniques

The development of pump fault diagnosis has undergone several paradigm shifts, with each generation of technology addressing the fundamental limitations of its predecessors. The expert-driven era (1980s–2000s) was dominated by rule-based systems [16]. These systems aimed to codify the diagnostic logic of human experts, such as linking vibration RMS velocity exceeding a threshold like 4.5 mm/s to specific bearing defects, achieving accuracies of 68–72% in controlled settings [17]. However, these methods relied heavily on manual interpretation of vibration features and struggled to generalize across variable operating conditions. While techniques like Statistical Process Control (SPC) introduced anomaly detection, they were less effective in handling the nonlinear vibration patterns present during transient states, such as pump startups [18].

The machine learning wave (2000s–mid-2010s) marked a shift toward data-driven diagnostics. Support Vector Machines (SVMs) became a prominent tool; for instance, research combining time-domain features with frequency harmonics on centrifugal pump datasets achieved 85.3% accuracy [19]. The primary constraints of these models were their dependence on handcrafted features, which required significant domain expertise, and their limited capacity for representing compound faults. Ensemble methods like Random Forests partially addressed these issues, demonstrating 82.7% precision in diagnosing concurrent impeller wear and misalignment, though this was still less accurate than human experts in noisy environments [20].

The ongoing deep learning revolution (mid-2010s–present) has redefined diagnostic capabilities [21]. Convolutional Neural Networks (CNNs), when applied to time–frequency images generated by methods like the Continuous Wavelet Transform (CWT), achieved breakthrough performance [22]. Wang et al, for example, attained 93.1% accuracy in detecting incipient bearing spalls (<1 mm²) using a Scalogram–CNN architecture [23]. Recurrent Neural Networks (RNNs), particularly LSTM variants, further leveraged temporal dependencies in vibration sequences [24]. A hybrid CNN–LSTM framework set a new benchmark with 95.4% accuracy across 12 fault classes, although its high computational intensity (3.8 s inference time) hindered real-time deployment [25]. While these deep learning models excel at automatically learning discriminative features and achieving state-of-the-art accuracy, their primary drawback is a profound lack of interpretability. They often function as “black boxes,” eroding the trust necessary for adoption in high-stakes industrial settings.

2.2. Interpretable AI and Chain-of-Thought Reasoning

The trade-off between the accuracy of black-box models and the need for trustworthy, Explainable AI (XAI) has created a significant opportunity for innovation. In industrial environments, where maintenance decisions carry substantial financial and safety implications, an AI’s ability to explain its reasoning is critical for building user trust and facilitating adoption.

Recently, Chain-of-Thought (CoT) reasoning, a technique originating from the field of large language models (LLMs), has shown remarkable potential in eliciting complex, step-by-step reasoning processes [26]. The core idea of CoT is to prompt a model to break down a complex problem into a sequence of intermediate, logical steps that ultimately lead to a final answer. This approach not only improves performance on complex tasks but also provides an interpretable window into the model’s decision-making process, as the generated chain of thought serves as a natural explanation. Time-domain analysis is the most direct method for vibration signal analysis, evaluating the equipment’s operational state from the perspective of signal amplitude and temporal distribution.

We posit that this reasoning paradigm is exceptionally well-suited for industrial fault diagnosis. The diagnostic process undertaken by a human expert—observing symptoms, forming hypotheses, and sequentially verifying evidence (e.g., checking for specific frequency harmonics after detecting a temporal impulse)—is fundamentally a chain of thought. While CoT has proven revolutionary in natural language processing, its application to numerical, non-stationary time-series data from machine vibrations is a novel and underexplored research avenue [27]. This paper aims to bridge that gap by adapting the CoT paradigm to create a diagnostic framework that combines the feature-learning power of deep learning with the transparent, logical reasoning of an expert system.

3. Methodology: The Temporal-Aware CoT Framework

3.1. System Architecture and Problem Formulation

The fundamental problem in industrial pump monitoring is the conversion of complex, high-frequency triaxial sensor data into clear, actionable maintenance insights. The VA–COT framework is designed as an end-to-end system that processes axial, radial, and tangential vibration streams, which are sampled at 25.6 kHz and pre-processed with a bandpass filter (0.1–8000 Hz). The system specializes in diagnosing critical failure modes, including bearing defects that generate characteristic frequency spikes, impeller faults manifesting as harmonic clusters, and shaft misalignment, which produces dominant rotational frequency components.

For each detected anomaly, the framework generates three key outputs: a fault classification (from 12 distinct fault types), a continuous severity score (ranging from 0.0 to 1.0), and a corresponding maintenance action (e.g., Monitor, Alert, Shutdown). For example, a score below 0.3 may indicate minor bearing spalling and trigger routine monitoring, while a score exceeding 0.7 signals a critical condition like severe misalignment that requires immediate intervention.

The VA-COT architecture implements a four-layer processing cascade (Figure 1):

Data Acquisition and Preprocessing: Raw triaxial vibration data is acquired, filtered, normalized, and segmented.
Multi-Domain Feature Extraction: Time-domain, frequency-domain, and time–frequency features are extracted in parallel to form a comprehensive representation of the machine’s health.
Adaptive Feature Fusion: The extracted features are dynamically weighted and fused into an optimized feature vector that serves as the input for the reasoning engine.
VA-COT Reasoning Engine: The core of the framework executes knowledge-guided inference chains with an adaptive depth mechanism to balance accuracy and efficiency, culminating in a diagnostic report.

3.2. Data Preprocessing and Multi-Domain Feature Engineering

To ensure the reliability of the diagnostic process, raw signals must undergo meticulous preprocessing before feature extraction. This workflow begins with the removal of any DC component to eliminate baseline offset. Subsequently, a band-pass filter is applied with a passband of 0.1 Hz to 10 kHz. The low cutoff frequency (0.1 Hz) removes signal drift, while the high cutoff frequency (10 kHz) suppresses random noise while preserving the high-frequency impact signals characteristic of incipient bearing faults. Finally, the filtered data is standardized and normalized to mitigate the influence of amplitude variations, ensuring stability during model training.

Following preprocessing, a multi-domain feature extraction strategy is employed to construct a holistic and robust characterization of the pump’s health status.

Time-Domain Analysis: This method quantifies a signal’s amplitude distribution and waveform morphology over time. We extract key statistical metrics, including Root Mean Square (RMS), Peak, Kurtosis, Skewness, and Crest Factor. Kurtosis, in particular, is highly sensitive to impulsive faults, like early-stage bearing pitting. The formula for kurtosis is as follows [28]:

K = \frac{\frac{1}{N} \sum_{i = 1}^{N} (x_{i} - \bar{x})^{4}}{(\frac{1}{N} \sum_{i = 1}^{N} (x_{i} - \bar{x})^{2})^{2}}

(1)

where x_i represents the discrete sample points and

\bar{x}

is the mean. The RMS value reflects the signal’s energy, calculated as follows [29]:

R M S = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} x_{i}^{2}}

(2)

where x_i is a discrete sample point and N is the total number of samples.

Frequency-Domain Analysis: Using the Fast Fourier Transform (FFT), the time-domain signal is converted into its frequency spectrum to identify characteristic frequencies associated with specific components. We focus on identifying the rotating frequency (f_r), blade pass frequency (f_bp = Z × f_r), and bearing characteristic fault frequencies (BPFI, BPFO, BSF, FTF) to diagnose issues like unbalance, impeller damage, and bearing defects.

Time–Frequency Analysis: To handle non-stationary signals common in industrial applications, time–frequency methods are used to capture how frequency components evolve over time. We utilize the following:

Continuous Wavelet Transform (CWT): This method provides multi-resolution analysis ideal for detecting transient impacts. It is defined as follows [30]:

W (a, b) = \frac{1}{\sqrt{a}} \int_{- \infty}^{\infty} x (ψ) Ψ (\frac{t - b}{a}) d t

(3)

where a is the scale factor, b is the translation factor, and ψ(t) is the mother wavelet function.

Empirical Mode Decomposition (EMD): This adaptive method decomposes a signal into a series of Intrinsic Mode Functions (IMFs), each representing a distinct oscillatory mode. The signal x(t) is represented as follows [31]:

x (t) = \sum_{i = 1}^{n} I M F_{i} (t) + r_{n} (t)

(4)

where r_n(t) is the final residual component.

Finally, these multi-domain features are integrated using an adaptive fusion algorithm. The algorithm calculates feature importance scores to derive domain-specific weights (α, β, γ), which are then used to combine the standardized feature sets into a final, optimized fused feature vector for the VA-COT engine.

3.3. Fault Expert Prior Modeling

To overcome the “black-box” limitations of purely data-driven models and enhance interpretability, the VA-COT framework integrates a systematic model of expert domain knowledge. This model is structured into two primary components: an Intelligent Alarm Mechanism for anomaly detection and a Fault Diagnosis Model for root cause identification.

3.3.1. Intelligent Alarm Mechanism

The system employs a dual-mode alarm strategy to ensure both reliability and sensitivity.

Hard Threshold Alarms: This classic approach uses fixed “warning” and “danger” thresholds for key indicators (e.g., overall vibration values). These thresholds are set based on industry standards and historical data, providing a simple and reliable baseline for anomaly detection [32].

Intelligent Alarms: This advanced mechanism uses a comprehensive suite of fault indicators derived from time-domain, frequency-domain, and envelope spectrum analysis [33]. Instead of fixed limits, it leverages statistical regression and time-series forecasting to compute adaptive thresholds based on the equipment’s historical performance. This data-driven approach maximizes the detection rate of slowly deteriorating faults while minimizing false alarms [34]. It also grades alarms by severity and deterioration speed, helping engineers prioritize responses.

3.3.2. Fault Diagnosis Models

Upon triggering an alarm, the VA-COT engine invokes specific diagnostic models built upon an expert knowledge base. This knowledge explicitly links fault types to their characteristic vibration signatures. The diagnostic logic is organized by component, as detailed in Table 1.

This formalized knowledge, combining if–then rules and a structured knowledge graph, provides the foundational logic for the VA-COT reasoning chains, ensuring that the diagnostic process is both accurate and interpretable [35].

3.4. VA-COT: An Interpretable Reasoning Framework

3.4.1. Adaptive Reasoning Depth Control

A key innovation of the VA-COT framework is its ability to dynamically balance diagnostic precision and computational efficiency [36]. To achieve this, we developed an Adaptive Reasoning Depth Control mechanism. This mechanism can assess online the complexity of the current vibration signal and dynamically allocate appropriate computational resources to the VA-COT engine, thereby maximizing reasoning efficiency while ensuring diagnostic accuracy.

The logical flow of this mechanism is illustrated in detail in Figure 2.

The mechanism first assesses the complexity of the input signal using a comprehensive scoring function, Scomplexity. This score integrates four key indicators [37]:

S_{c o m p l e x i t y} = w_{1} \cdot H (f) + w_{2} \cdot S_{f a s t} + w_{3} \cdot N_{l e v e l} + w_{4} \cdot I_{c o m p o u n d}

(5)

where H(f) is the spectral entropy measuring spectral disorder, S_fault is a fault severity index, N_level is the estimated noise level, and I_compound is an indicator for the presence of multiple concurrent faults.

Based on the calculated complexity score, the system selects one of three reasoning depths:

Shallow Reasoning (e.g., score < 0.3): For clear, simple faults, a 2–3 step process is used for rapid diagnosis.

Medium Reasoning (e.g., 0.3 ≤ score < 0.7): For cases with moderate noise or minor compound faults, a 4–6 step analysis involving more advanced tools is activated.

Deep Reasoning (e.g., score ≥ 0.7): For highly complex or severe compound faults, a comprehensive analysis of 7 or more steps is invoked, integrating multi-source information and knowledge graph-based reasoning.

To further enhance efficiency, the reasoning chain itself is dynamically optimized. This includes on-the-fly feature selection based on preliminary judgments, path optimization using reinforcement learning from historical cases, and an early stopping mechanism that terminates the reasoning process once a diagnostic conclusion reaches a high confidence threshold (e.g., 95%), preventing unnecessary computation.

To ensure the effectiveness and logical integrity of the Adaptive Reasoning Depth Control mechanism, this section elaborates on its triggering mechanism, specifically the calculation of the complexity score, Scomplexity [38]. This scoring process is a critical pre-processing step designed to triage the diagnostic difficulty of a signal before committing to the formal, multi-step VA-COT reasoning chain.

Crucially, it must be clarified that the indicators used in Equation (5) are all fast, preliminary estimations derived using computationally inexpensive heuristics. Their sole purpose is to select an appropriate reasoning path, not to serve as the final diagnostic output, thereby resolving any suspicion of circular reasoning. The rapid estimation methods for each indicator are as follows:

Fault Severity Index (Sfault): This is a heuristic preliminary index, not the final precise score. It is rapidly estimated by comparing the overall energy (RMS value) of the current vibration signal to a historical baseline established during the equipment’s healthy operation. For instance, a signal with energy 50% above its baseline will yield a significantly higher preliminary severity index than one that is 10% above.

Spectral Entropy (H(f)): This metric is directly calculated from the Fast Fourier Transform (FFT) spectrum, which is already generated during the preprocessing stage, to measure spectral disorder. It is a standard and computationally cheap calculation.

Noise Level (Nlevel): The noise level is estimated by analyzing the signal’s spectral floor or the average energy in non-dominant frequency bands. This serves as a quick proxy for the signal-to-noise ratio.

Compound Fault Indicator (Icompound): This indicator is estimated via a fast spectral analysis algorithm that identifies and counts the number of independent, prominent harmonic families present in the spectrum (e.g., a rotational frequency harmonic family, a specific bearing fault frequency harmonic family, etc.). The presence of multiple independent harmonic families is a strong indicator of a more complex compound fault.

Regarding the weights (w1, w2, w3, w4) in Equation (5), they are not set empirically. Instead, they are treated as critical hyperparameters and are determined through a rigorous optimization process. Using a validation subset of our benchmark dataset, a Grid Search methodology was employed to systematically test various weight combinations [39]. The objective function for evaluating each combination was a weighted balance between diagnostic accuracy (F1-Score) and average inference latency. The set of weights that yielded the optimal performance on this objective function was ultimately selected, ensuring that the mechanism allocates computational resources most effectively while maintaining high diagnostic precision [40].

Through this two-stage strategy—a rapid preliminary assessment followed by a selected-depth reasoning—the VA-COT framework can effectively adapt its behavior to the task’s difficulty, resolving the potential for circular logic and achieving a balance between diagnostic accuracy and computational efficiency.

3.4.2. The VA-COT Hybrid Reasoning Engine: Concepts and Mechanisms

To address the challenge of applying Chain-of-Thought (CoT) reasoning to numerical vibration data, the VA-COT engine is designed not as a natural language generator, but as a knowledge-guided hybrid reasoning engine. It operationalizes the diagnostic process of a human expert by creating a sequence of structured, verifiable steps [41]. The “chain of thought” is an internal, structured data representation, not a textual output from the core engine [42]. This hybrid approach synergizes a rule-based knowledge base for interpretability with a lightweight neural network for data-driven accuracy [43].

A.: The Structured Thought Node

The fundamental unit of the reasoning chain is the “Thought Node,” a structured object that captures the state of a single diagnostic step. The entire chain is a chronologically ordered list of these nodes. Each node is defined as follows:

ThoughtNode = {

step_id: Integer,

//The sequence number of the step (e.g., 1, 2, 3).

current_hypothesis: String,

//The primary hypothesis being tested in this step (e.g., “Investigating potential bearing fault”).

supporting_evidence: Dict,

//Quantitative evidence from the feature vector supporting the hypothesis (e.g., {“Kurtosis”: 4.5, “BPFO_Amplitude”: 0.8}).

confidence_score: Float,

//The calculated confidence (0.0 to 1.0) in the current hypothesis.

status: Enum

//The outcome of the investigation for this node: [“Confirmed”, “Rejected”, “Requires_Further_Analysis”].

}

```

This structure transforms the abstract CoT concept into a concrete computational entity, allowing the diagnostic path to be explicitly tracked and audited.

B.: The Iterative Reasoning Process

The VA-COT engine constructs the chain of Thought Nodes through a Hypothesis–Verification–Iteration loop, which begins after the complexity score $S\_{complexity}$ has determined the required reasoning depth (Shallow, Medium, or Deep).

Initiation (Step 0: Anomaly Detection): The process starts with the fused multi-domain feature vector. The engine first checks macro-level health indicators (e.g., overall RMS, Peak value). If these indicators breach the baseline established by the intelligent alarm mechanism (Section 3.3.1), the first node, `ThoughtNode_0`, is generated with the hypothesis “Anomaly Detected.”
Hypothesis Generation and Selection (The Hybrid Core): For each subsequent step, the engine generates and selects a new hypothesis using its hybrid core. This process is detailed further in the following section.
Evidence Focusing and Confidence Update: Once a hypothesis is selected, the engine focuses its analysis on the most relevant features from the input vector. This targeted evidence is used to calculate a new confidence_score for the hypothesis.
Iteration and Termination: The engine evaluates the confidence_score. If it exceeds the predefined threshold (e.g., 95%), the hypothesis is marked as “Confirmed,” and the reasoning chain may terminate early to prevent unnecessary computation. If the confidence is low, the loop continues to the next step until the maximum depth is reached or a conclusion is found.

To ensure the auditability and reproducibility of the reasoning process, the components of the hybrid engine are precisely defined. Our knowledge base consists of 92 rules derived from domain expertise and maintenance manuals, covering the primary failure modes of the target equipment. In cases where multiple rules are triggered simultaneously by the evidence from a ThoughtNode, all resulting candidate hypotheses are passed to the Hypothesis Scorer. This neural network thus acts as the conflict resolution mechanism, using data-driven evidence to arbitrate between competing, rule-based suggestions. The Hypothesis Scorer itself is a lightweight Multi-Layer Perceptron (MLP) with two hidden layers (64 and 32 neurons, respectively) using ReLU activation functions, optimized for high-speed inference on edge devices. Further details on the reasoning engine’s specifications are provided in Appendix C.

C.: Connecting the Reasoning Steps: The Decision and Transition Mechanism

The transition from one reasoning step (N) to the next (N + 1) is a structured process that ensures the diagnostic path is logical and efficient. This mechanism directly answers how the output of one step informs the input for the next.

Let us use an example: after the engine confirms an “impact in the time domain” (ThoughtNode_N), it “decides” to check for “specific frequency harmonics” in ThoughtNode_N + 1 as follows:

Knowledge-Guided Interrogation Targeting: The confirmed hypothesis from ThoughtNode_N (e.g., “Impulsive Fault Signature Detected”) is used to query the expert knowledge base. The knowledge base contains explicit mappings between fault signatures and subsequent analytical procedures. A rule such as IF signature IS “Impulsive” THEN next_target IS “Analyze_Frequency_Spectrum_for_Harmonics” dictates the next logical action. This provides a transparent, rule-based justification for the transition.
Scoping the Hypothesis Space: The determined next_target (“Analyze_Frequency_Spectrum”) is then used to constrain the search space for the next set of hypotheses. The engine queries the knowledge base again with this new context (e.g., GIVEN context IS “Impulsive_Fault” AND target IS “Analyze_Spectrum”, retrieve plausible root_causes). This results in a refined, focused list of candidates, such as [“Outer Race Defect (BPFO)”, “Inner Race Defect (BPFI)”, “Rolling Element Defect (BSF)”], preventing the consideration of irrelevant faults.
Data-Driven Selection: Finally, the lightweight neural “Hypothesis Scorer” takes this refined list of candidates and the full feature vector as input. Guided by the interrogation target, it prioritizes the frequency-domain features and assigns a relevance score to each candidate. The hypothesis with the highest score (e.g., “Outer Race Bearing Defect,” if a strong BPFO signal is present) is selected as the current_hypothesis for ThoughtNode_N + 1.

This structured transition ensures the entire Chain-of-Thought proceeds in a manner that is both mechanistically sound and interpretable to a human expert.

D.: Final Report Generation

The output of the core VA-COT engine is the final, structured list of Thought Nodes. This list is then passed to a separate, post-processing “Report Generation Module.” This module uses a template-based system to translate the structured findings into the human-readable diagnostic text, such as that presented in the industrial case study. For example, a confirmed ThoughtNode with hypothesis = “rubbing characteristics” and evidence pointing to the “free-end bearing” is mapped to the corresponding sentence in a predefined report template. This decouples the core numerical reasoning from complex natural language generation, ensuring efficiency and reliability.

3.4.3. Execution of the Reasoning Chain: An Algorithmic View

To provide an unambiguous and transparent view of the VA-COT engine’s internal workflow, the conceptual mechanisms described in the previous sections are formalized in Algorithm 1. This pseudocode illustrates the end-to-end process, from receiving the fused feature vector to outputting a final diagnostic report. It integrates the adaptive depth control, the iterative hybrid reasoning loop, and the final report generation into a single, cohesive algorithm.

Algorithm 1: The VA-COT Diagnostic Process

Function VA_COT_Diagnose(Input: fused_feature_vector)

//Phase 1: Adaptive Depth Selection

complexity_score = Calculate_Complexity_Score(fused_feature_vector)

max_depth = Select_Max_Depth(complexity_score)

//Phase 2: Initialization of the Reasoning Chain

reasoning_chain = new List<ThoughtNode>()

initial_node = Create_Initial_Node(fused_feature_vector)//e.g., Hypothesis: “Anomaly Detected”

reasoning_chain.add(initial_node)

//Phase 3: Iterative Reasoning Loop

for step from 1 to max_depth:

`last_confirmed_node = reasoning_chain.get_last_confirmed()`

//**Knowledge-guided hypothesis proposal using expert prior** [cite: 175]

`candidate_hypotheses = Propose_Hypotheses(knowledge_base, last_confirmed_node)` [cite: 186, 189]

//**Data-driven hypothesis selection**

`selected_hypothesis = Score_and_Select_Hypothesis(neural_scorer, fused_feature_vector, candidate_hypotheses)`

//**Evidence gathering and confidence update**

`evidence, confidence = Gather_Evidence(fused_feature_vector, selected_hypothesis)`

//**Create and add the new node to the chain**

`new_node = new ThoughtNode(step, selected_hypothesis, evidence, confidence)`

`reasoning_chain.add(new_node)`

//**Check for early stopping based on confidence threshold** [cite: 235]

**if** `confidence > CONFIDENCE_THRESHOLD`:

`new_node.status = “Confirmed”`

**break**//Exit loop if conclusion is confident

**end if**

end for

//Phase 4: Final Report Generation

diagnostic_report = Generate_Report_From_Chain(template_module, reasoning_chain)//Generates Fault Type, Severity Score, and Maintenance Action

Return diagnostic_report

End Function

4. Experiment Setup and Evaluation

4.1. Experimental Platform and Datasets

4.1.1. Benchmark Dataset

To evaluate the core performance of the VA-COT framework in a controlled environment, a comprehensive benchmark dataset was utilized. This dataset comprises 5400 labeled vibration samples collected from both centrifugal and gear pumps, encompassing 12 distinct fault categories. A detailed description of the equipment, operating parameters, data acquisition system, and a full list of all 12 fault categories are provided in Appendix A.

4.1.2. Industrial Validation Platform

For real-world validation, the VA-COT framework was deployed on a commercial-grade remote O&M cloud platform, with the edge-side processing handled by an industrial single-board computer (SBC) featuring an ARM Cortex-A72 processor with 1 GB of RAM [44]. This hardware was chosen to reflect typical processing capabilities available in modern IIoT gateways. While optimization techniques such as INT8 quantization with frameworks like CMSIS-NN were evaluated, the reported performance results utilize the full-precision (FP32) model to ensure maximum diagnostic accuracy, as the available memory was sufficient. A detailed breakdown of the on-device memory footprint is provided in Appendix E.

4.2. Baseline Models and Evaluation Metrics

To benchmark the performance of the proposed VA-COT framework, we compared it against several baseline models:

Standard CoT: A version of the Chain-of-Thought model without the temporal-aware adaptations.

CNN–LSTM: A state-of-the-art deep learning model commonly used for time-series classification, which can automatically learn features from raw or minimally processed data.

Support Vector Machine (SVM): A classic and powerful machine learning model often used in fault diagnosis, trained on the same fused feature set as VA-COT.

PatchTST [X]: A state-of-the-art, Transformer-based model for time-series forecasting and classification, representing modern attention-based architectures.

TinyVGG-T: An edge-optimized Convolutional Neural Network, adapted for spectral analysis, representing lightweight deep learning solutions designed for efficiency.

Performance was evaluated using standard classification metrics, including Accuracy, Precision, Recall, and F1-Score. To assess the framework’s suitability for real-time industrial deployment, we also measured average Inference Latency and Memory Usage on the target edge device. The trade-off between accuracy and latency is visualized using a Pareto front analysis.

5. Results

5.1. Performance on Benchmark Dataset

The VA-COT framework was rigorously evaluated on the comprehensive 12-class benchmark dataset. As detailed in Table 2, the framework achieved a superior overall accuracy of 93.2% and an F1-Score of 92.9%. This performance significantly surpasses that of established baseline models, including the state-of-the-art CNN–LSTM (91.2% accuracy, 90.6% F1-Score) and a classic SVM model (88.1% accuracy, 87.7% F1-Score).

An ablation study (Table 3) confirmed the critical contributions of the framework’s core components. Removing the multi-domain feature fusion module led to a notable drop in accuracy, underscoring the importance of a holistic feature set. Similarly, disabling the adaptive reasoning mechanism resulted in a nearly threefold increase in latency (from 1.05 s to 2.85 s) with only a marginal gain in accuracy, validating its effectiveness in optimizing computational resources.

The accuracy–latency trade-off, visualized in the Pareto curve in Figure 3, further illustrates VA-COT’s practical superiority for edge deployment. While models like PatchTST achieve a slightly higher F1-Score (93.1%), their inference latency is over four times greater (4.52 s). Conversely, faster models like TinyVGG-T and SVM sacrifice significant diagnostic accuracy. VA-COT is positioned at the optimal point on the Pareto front, delivering high accuracy with an average latency of just 1.08 s on the target edge device, making it the most balanced solution for real-world industrial applications.

5.2. Industrial Validation and Practical Utility

To validate its real-world efficacy, the VA-COT framework was deployed for six months at the Huafeng Thermoelectric facility. The system demonstrated significant operational value, contributing to a 35% reduction in maintenance costs and accelerating diagnostic processes by 98% compared to manual methods.

A compelling case study involved a critical boiler feedwater pump. The VA-COT system automatically flagged the pump with a “Warning” status and generated a detailed, interpretable diagnostic conclusion: “1. Rubbing characteristics are present at the pump’s free end, which is presumed to be related to bearing damage… 2. Pitting damage and potential minor spalling exist on the inner race of the free-end bearing…”. This AI-driven recommendation prompted a scheduled inspection. Subsequent physical teardown confirmed the diagnosis, revealing “severe wear on the end face between the balance sleeve and the balance drum” and “deep wear marks on the inner surface of the casing seal ring,” which directly corresponded to the identified rubbing and bearing damage. This successful closed-loop validation—from automated online diagnosis to confirmed physical evidence—powerfully demonstrates the framework’s accuracy and practical value in a complex industrial setting.

The framework’s robustness and efficiency were further substantiated by the following:

Cross-Machine Generalization: In a zero-shot transfer task, the model, trained only on centrifugal and gear pump data, retained a high Macro-F1 score of 86.3% when applied to an unseen dataset from industrial screw pumps, demonstrating strong foundational generalization capabilities (Appendix D, Table A2).

Edge Deployment Efficiency: Operating on a resource-constrained ARM Cortex-A72 based single-board computer, the VA-COT framework maintained a minimal memory footprint of just 4.4 MB, confirming its suitability for widespread deployment in IIoT edge computing environments (Appendix E, Table A3).

5.3. Quantitative Evaluation of Explainability

To objectively measure the framework’s interpretability, we conducted a two-pronged evaluation protocol, moving beyond qualitative assessment.

Automated Faithfulness Analysis: We measured how accurately the generated “thought” steps reflected the model’s internal decision-making. By systematically perturbing key features cited as evidence in the reasoning chain, we observed a corresponding drop in the model’s output confidence. The results yielded a high average Faithfulness Score of 0.89 (±0.06), indicating a strong correlation between the explanation provided and the model’s internal logic (Table 4).

Human-Centered Blind Study: Three independent domain experts, blind to the model’s predictions, evaluated the reasoning chains for 30 randomly selected cases. The explanations received high average scores for technical Correctness (4.6/5) and Clarity (4.7/5), with strong inter-rater reliability (ICC of 0.86). A statistically significant positive correlation (p < 0.01) was found between the model’s confidence and the experts’ correctness ratings, confirming that the framework’s reasoning aligns with human diagnostic logic.

These quantitative results confirm that VA-COT successfully balances high diagnostic accuracy with transparent, trustworthy, and actionable insights, directly addressing the critical interpretability gap in conventional “black-box” AI systems. Visual examples of the step-by-step diagnostic reasoning chains for healthy, simple, and complex fault signals are provided in Appendix B.

6. Discussion

6.1. Interpretation of Key Findings

The experimental results robustly demonstrate that the VA-COT framework’s superior performance stems from its unique synthesis of data-driven feature representation and knowledge-guided reasoning. Unlike purely data-driven models like CNN–LSTM that learn opaque correlations, VA-COT leverages a structured expert knowledge base to guide its diagnostic process, effectively mimicking an expert’s logical flow. This synergy allows it not only to identify anomalies but also to interpret their context, leading to higher diagnostic accuracy, as evidenced by its 93.2% accuracy on the 12-class benchmark dataset (Table 3). The ablation study further validates this, showing that removing the multi-domain feature fusion module leads to a significant drop in accuracy, confirming the necessity of a comprehensive feature set for robust diagnosis (Table 3).

The framework’s efficiency gains are primarily explained by its adaptive reasoning depth control. This mechanism dynamically allocates computational resources based on signal complexity, achieving real-time performance on edge devices without sacrificing the depth of analysis required for complex faults. The ablation study highlights this trade-off: disabling this mechanism nearly triples the inference latency (from 1.05 s to 2.85 s) for only a marginal accuracy improvement (Table 3). This efficiency is not merely an academic benefit but a critical requirement for scalable deployment. The accuracy–latency Pareto curve (Figure 3) clearly positions VA-COT as the optimal solution for edge computing, delivering high accuracy at a fraction of the computational cost of powerful but slow models, like PatchTST. However, the 86.3% Macro-F1 score observed in the zero-shot cross-machine generalization experiment (Appendix D, Table A2) suggests that, while the foundational reasoning is robust, the expert priors may still require tuning for novel equipment types not seen during training.

6.2. The Significance of Interpretability in Industrial Practice

A core contribution of this work lies in bridging the critical gap between AI-driven diagnostics and operational trust. The chain of thought generated by VA-COT provides a transparent, step-by-step explanation for its conclusions, which is a stark contrast to the opaque outputs of typical deep learning models. This was powerfully demonstrated in the Huafeng Thermoelectric case study. The system’s output was not a simple label like “Fault Class 11,” but a logical narrative: “1. Rubbing characteristics are present at the pump’s free end… 2. Pitting damage and potential minor spalling exist on the inner race of the free-end bearing…”.

This transparent reasoning process is what builds confidence and empowers plant engineers to make informed, high-stakes decisions, such as scheduling a costly but necessary maintenance intervention. The subsequent physical teardown, which confirmed the AI’s findings, provides a closed-loop validation of this trust. This transforms the AI from a “black-box” tool into a trustworthy collaborative partner, a crucial step for its adoption in safety-critical industrial environments. The quantitative evaluation of explainability further substantiates this, with the framework achieving a high average Faithfulness Score of 0.89 and receiving excellent ratings from domain experts on Correctness (4.6/5) and Clarity (4.7/5) (Table 4). This alignment with human diagnostic logic is directly linked to practical benefits, such as the 35% reduction in maintenance costs observed during the industrial validation.

6.3. Limitations and Future Work

Despite its promising results, this study has several limitations that open clear avenues for future research.

First, the performance of the VA-COT framework is partly dependent on the comprehensiveness of its initial expert prior base. The creation of this knowledge base can be labor-intensive and requires significant domain expertise. Future work will explore the use of Large Language Models (LLMs) to semi-automatically mine maintenance logs and technical manuals to automate and expand this knowledge base. To ensure the knowledge base remains a dynamic and evolving asset, a life-cycle management approach is necessary, incorporating continuous validation against field maintenance reports and a modular architecture for efficient updates.

Second, while the framework was validated in a real-world setting, its generalization capability requires further investigation, as highlighted by the 6.6% performance drop in the zero-shot transfer experiment. To enhance stability across diverse equipment and operating conditions, future work must integrate Deep Transfer Learning (DTL) methodologies. This would involve strategies such as model fine-tuning or domain adaptation to adapt the framework to new machine types with minimal labeled data, thereby improving its robustness.

Third, the current evaluation relies on standard machine learning metrics (e.g., accuracy, F1-score), which do not fully capture the framework’s real-world value. A crucial next step is to conduct a longitudinal field study to evaluate the system using maintenance-centric KPIs. This would involve quantifying its economic and operational impact by measuring metrics such as the average lead-time of early warnings, the impact of false-alarm rates on maintenance costs, and the overall reduction in unplanned downtime and associated expenses.

Fourth, the long-term reliability of the framework necessitates a more comprehensive robustness evaluation. The current pre-processing is described simply as a band-pass filter, without validation under adverse conditions. Future work must include rigorous stress tests to substantiate the framework’s reliability in long-term industrial use. This evaluation should simulate real-world data imperfections, including performance under systematic SNR (Signal-to-Noise Ratio) degradation, resilience to sensor drift, and stability against changes in the machine’s baseline signature due to hardware aging.

Finally, looking beyond diagnostics, a logical and high-value extension is to adapt the chain-of-thought reasoning for prognosis, specifically enabling the prediction of Remaining Useful Life (RUL). This future work will focus on developing models that output probabilistic RUL forecasts with confidence intervals. Such outputs are designed to directly interface with Enterprise Resource Planning (ERP) systems, enabling the automation of procurement of parts and the optimization of maintenance scheduling, thus providing a complete, strategic solution for predictive maintenance.

7. Conclusions

This paper introduced the Vibration-Aware Chain-of-Thought (VA-COT) framework, a novel approach for intelligent pump fault diagnosis that prioritizes both accuracy and interpretability. By integrating multi-domain feature fusion with a knowledge-guided, adaptive reasoning engine, VA-COT effectively addresses the “black-box” problem inherent in many conventional deep learning models.

Our extensive experiments demonstrated the superiority of the VA-COT framework. On a 12-class benchmark dataset, it achieved 93.2% accuracy, outperforming standard CoT and CNN–LSTM models while drastically reducing latency and memory usage. More significantly, its practical efficacy was validated through a six-month deployment in an industrial setting, highlighted by a case study where an online automated diagnosis was subsequently confirmed by physical teardown inspection.

By successfully adapting the Chain-of-Thought paradigm for complex, temporal vibration data, this work provides a robust and trustworthy solution for real-world predictive maintenance. The VA-COT framework represents a significant step toward creating more transparent, reliable, and collaborative AI systems for critical industrial applications, fostering greater confidence in the adoption of Industry 4.0 technologies.

Author Contributions

Conceptualization, Q.L. and Z.Z.; methodology, J.Z. and Q.L.; software, Z.L.; validation, J.Z., Z.L. and Q.L.; formal analysis, J.Z., Z.L. and Q.L.; investigation, Z.L. and Q.L.; resources, J.Z. and Z.Z.; data curation, J.Z., Z.L. and Q.L.; writing—original draft preparation, J.Z. and Q.L.; writing—review and editing, Q.L.; visualization, Z.L. and Z.Z.; supervision, Q.L.; project administration, Q.L.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

Industry–University-Research cooperation project in Ruian city, Zhejiang Province (Project name: Research on adaptive finite element analysis and error optimization technology for gear design based on AI Agent, Contract NO: 2025110143HX).

Data Availability Statement

The 12-class pump vibration dataset generated and analyzed during the current study is scheduled for public release in a Zenodo repository upon acceptance of this manuscript. The source code for the VA-COT framework, the trained Hypothesis Scorer model (scorer.onnx), and a representative, anonymized subset of the knowledge base (rules.json) are openly available in the project’s GitHub. The complete knowledge base contains proprietary information derived from our industrial partner and is therefore not publicly available at this time, pending their final approval for release. It may be available from the corresponding author upon reasonable request and with the express permission of the industrial partner.

Conflicts of Interest

Author Zuopeng Zheng is an employee of Zhejiang TONGLI Transmission Technology Co., Ltd. The other authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Benchmark Dataset Specifications

This appendix provides detailed specifications for the comprehensive 12-class pump fault diagnosis dataset created and used for the evaluation of the VA-COT framework. The dataset was collected from operational high-end pump units to ensure the data reflects real-world industrial conditions.

Appendix A.1. Equipment and Operating Conditions

Pump Types: The dataset includes vibration data collected from two primary types of industrial pump: Centrifugal pumps (e.g., boiler feedwater pumps) and gear pumps. A representative model included in the data acquisition is the Boiler Feedwater Pump, Model 150 × 100(F)DF11M.

Operating Parameters: Data were collected under a variety of operating conditions to ensure model robustness.

Rotational Speed: The dataset covers a range of speeds, from low-speed applications (approx. 600 RPM) to medium-speed operations (approx. 1200 RPM and higher).

Load Conditions: Both steady-state (continuous operation) and non-steady-state (variable load) conditions are included in the dataset, reflecting diverse industrial scenarios.

Appendix A.2. Data Acquisition System

Sensors: Data were captured using industrial-grade, wireless vibration sensors. Both piezoelectric and MEMS sensor types were utilized, integrated into a single unit with temperature sensing capabilities. Key sensor specifications include the following:

Frequency Response: 2 Hz–20 kHz (±3 dB).
Measurement Range: ≥±50 g.
Protection Rating: IP67.

Data Collection: Triaxial vibration signals (axial, radial, tangential) were sampled continuously using a 24-bit A/D converter at a sampling frequency of 51,200 Hz to capture a wide range of fault signatures, from low-frequency unbalance to high-frequency bearing impacts.

Appendix A.3. Fault Categories

The dataset is structured into 12 distinct classes, including a healthy state and 11 common fault types distributed across the motor, coupling, and pump components. The fault conditions were induced and/or validated during scheduled maintenance and teardown inspections. The detailed fault categories are listed in Table A1.

Table A1. Description of the 12 Fault Categories in the Dataset.

Class ID.	Component.	Fault Category.	Description.
1.	-.	Healthy/Normal.	The pump unit is operating under normal conditions without detectable faults.
2.	Motor.	Bearing Inner Race Fault.	Damage, such as pitting or spalling, on the inner raceway of the motor bearing.
3.	Motor.	Bearing Outer Race Fault.	Damage on the outer raceway of the motor bearing.
4.	Motor.	Bearing Roller Fault.	Damage to the rolling elements (e.g., balls, rollers) of the motor bearing.
5.	Motor.	Rotor Imbalance.	Asymmetrical mass distribution on the motor rotor, often due to fan blade wear or dust accumulation.
6.	Coupling.	Misalignment.	Angular or parallel misalignment between the motor and pump shafts.
7.	Pump.	Bearing Fault.	Damage to one of the pump-side bearings (inner/outer race, roller, or cage).
8.	Pump.	Impeller Imbalance.	Asymmetrical mass distribution on the pump impeller due to wear, corrosion, or coking.
9.	Pump.	Structural Looseness.	Looseness of components such as bearing seats or anchor bolts on the pump side.
10.	Pump.	Cavitation.	Formation and collapse of vapor bubbles, typically causing broadband, random vibration signals. (Inferred as a standard pump fault).
11.	Pump.	Rubbing/Friction.	Unintended contact between rotating and stationary parts, such as between a balance sleeve and drum.
12.	System.	Foundation Looseness.	Looseness of the entire pump-motor skid from its foundation, causing distinct low-frequency vibration.

Appendix B. Explainability Metrics and Visual Reasoning Chains

Appendix B.1. Faithfulness Score

Faithfulness quantifies how well the evidence cited in a Thought Node reflects the model’s actual decision. For each sample i, let.

conf_orig = model confidence on original feature vector.

conf_perturb = model confidence after zeroing the features listed in supporting_evidence.

Faithfulness_i = (conf_orig_i − conf_perturb_i)/conf_orig_i.

The reported average Faithfulness of 0.89 (±0.06) is the mean over all 30 test samples.

Appendix B.2. Human-Evaluation Sample Selection

Thirty signals were selected by stratified sampling: 10 healthy samples, 10 simple single-fault samples, and 10 complex or compound-fault samples, ensuring balanced representation across diagnostic difficulty.

Appendix B.3. Visual Reasoning Chains

This subsection presents step-by-step reasoning for three cases: a healthy signal, a simple single-fault signal (Rotor Imbalance), and a complex compound-fault signal (Bearing Outer Race Fault with Looseness). Each step in the chain is accompanied by plots of the corresponding signal domain (e.g., time-domain waveform, frequency spectrum), with the specific features cited as supporting_evidence highlighted.

Appendix C. Open-Source Artifacts and Reproducibility

To facilitate full transparency and encourage further research, we are committed to making the core components of the VA-COT reasoning engine publicly available. The knowledge base, which consists of 92 rules, was co-developed with our industrial partner (Huafeng Thermoelectric) based on their proprietary operational data and maintenance expertise. As such, its full public release is subject to their final approval, a process which is currently underway.

Upon receiving formal clearance, the complete knowledge base (rules.json) and the trained Hypothesis Scorer model (scorer.onnx) will be made available in the project’s GitHub repository. To ensure immediate reproducibility, we have provided a representative, anonymized subset of the rules in the repository. This subset demonstrates the full structure, logic, and syntax of the knowledge base.

Appendix D. Cross-Model Generalization Experiment

To assess the generalization capability of the VA-COT framework beyond the pump types included in the main training dataset, we performed a zero-shot transfer learning experiment.

Experimental Protocol: The VA-COT model, trained exclusively on the dataset from centrifugal and gear pumps, was directly evaluated on a separate, unseen dataset collected from industrial screw pumps. The model received no fine-tuning or re-training on the target domain data.

Evaluation Metric: We measured the Macro-F1 score on the out-of-domain (screw pump) test set and compared it to the in-domain test set performance.

Results: As shown in Table A2, the model experienced a performance degradation of 6.6% when transferred to the new pump type. This demonstrates a strong foundational capability for generalization, as the model can still achieve a high diagnostic F1-score without any specific training on the target machine type. This suggests that the learned feature representations and reasoning structures are robust.

Table A2. Zero-Shot, Cross-Model Transfer Performance.

Test Set.	Macro-F1 Score.
In-Domain (Centrifugal/Gear Pumps).	92.90%.
Out-of-Domain (Screw Pumps).	86.30%.
Performance Drop.	−6.60%.

Appendix E. On-Device Resource Profiling

To provide full transparency on the practical deployment aspects of the VA-COT framework, this section details the on-device memory and power consumption profiles as measured on the target edge hardware (ARM Cortex-A72 based SBC).

Appendix E.1. Memory Footprint Analysis

The total memory usage of 4.4 MB reported in Table 2 is comprised of several components. Table A3 provides a detailed map of the RAM allocation during a typical inference cycle. The majority of the memory is consumed by the runtime feature extraction and processing buffers, rather than the model weights themselves.

Table A3. Detailed Memory Footprint of VA-COT on Edge Device.

Component.	Memory Usage (MB).	Description.
Neural Scorer Weights (FP32).	0.5 MB.	The memory required to load the lightweight MLP model.
Knowledge Base (JSON Rules).	0.2 MB.	The memory footprint of the loaded rule set for hypothesis generation.
Feature Extraction and Processing.	2.5 MB.	Memory allocated for raw signal segments, FFT, CWT, and other feature arrays.
VA-COT Engine Inference Cache.	1.2 MB.	Runtime cache for storing the ThoughtNode chain and intermediate states.
Total.	4.4 MB.	Total dynamic memory (RAM) usage during inference.

Appendix E.2. Power Consumption Analysis

Power consumption is a critical factor for continuously operating industrial sensors. We profiled the current draw of the edge device under a standard 3.3 V supply during different operational states. The results, shown in Table A4, indicate modest power requirements, confirming the feasibility of long-term, cost-effective deployment in factory environments.

Table A4. Power Consumption Profile of the Edge Device at 3.3 V.

Operational State.	Average Current (mA).	Peak Current (mA).
Idle (System on, no inference).	150 mA.	-.
VA-COT Inference (Average).	450 mA.	650 mA.

References

Khalid, S.; Jo, S.-H.; Shah, S.Y.; Jung, J.H.; Kim, H.S. Artificial intelligence-driven prognostics and health management for centrifugal pumps: A comprehensive review. Actuators 2024, 13, 514. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Hector, I.; Panjanathan, R. Predictive maintenance in Industry 4.0: A survey of planning models and machine learning techniques. PeerJ Comput. Sci. 2024, 10, e2016. [Google Scholar] [CrossRef] [PubMed]
Bagri, I.; Tahiry, K.; Hraiba, A.; Touil, A.; Mousrij, A. Vibration signal analysis for intelligent rotating machinery diagnosis and prognosis: A comprehensive systematic literature review. Vibration 2024, 7, 1013–1062. [Google Scholar] [CrossRef]
Sakthivel, N.R.; Sugumaran, V.; Babudevasenapati, S. Vibration based fault diagnosis of monoblock centrifugal pump using decision tree. Expert Syst. Appl. 2010, 37, 4040–4049. [Google Scholar] [CrossRef]
Kwon, W.; Lee, J.; Choi, S.; Kim, N. Empirical mode decomposition and Hilbert–Huang transform-based eccentricity fault detection and classification with demagnetization in 120 kW interior permanent magnet synchronous motors. Expert Syst. Appl. 2024, 241, 122515. [Google Scholar] [CrossRef]
Janarthanan, M.K.; Athisayam, A.; Krishna Moorthy, M.K.; Sivakumar, G.; Poornalingam, S. Integrated improved complete ensemble empirical mode decomposition and continuous wavelet transform approach for enhanced bearing fault diagnosis in noisy environments. Eng. Proc. 2025, 95, 13. [Google Scholar] [CrossRef]
Wang, T.; Meng, H.; Zhang, F.; Qin, R. Fault detection of wheelset bearings through vibration-sound fusion data based on grey wolf optimizer and support vector machine. Technologies 2024, 12, 144. [Google Scholar] [CrossRef]
Qiu, S.; Cui, X.; Ping, Z.; Shan, N.; Li, Z.; Bao, X.; Xu, X. Deep learning techniques in intelligent fault diagnosis and prognosis for industrial systems: A review. Sensors 2023, 23, 1305. [Google Scholar] [CrossRef]
Zhou, Q.; Tang, J. An interpretable parallel spatial CNN-LSTM architecture for fault diagnosis in rotating machinery. IEEE Internet Things J. 2024, 11, 31730–31744. [Google Scholar] [CrossRef]
Kowald, D.; Scher, S.; Pammer-Schindler, V.; Müllner, P.; Waxnegger, K.; Demelius, L.; Fessl, A.; Toller, M.; Mendoza Estrada, I.G.; Šimić, I.; et al. Establishing and evaluating trustworthy AI: Overview and research challenges. Front. Big Data 2024, 7, 1467222. [Google Scholar] [CrossRef] [PubMed]
Leite, D.; Andrade, E.; Rativa, D.; Maciel, A.M.A. Fault detection and diagnosis in Industry 4.0: A review on challenges and opportunities. Sensors 2025, 25, 60. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
Li, X.; Zhang, C.; Li, X.; Zhang, W. Federated transfer learning in fault diagnosis under data privacy with target self-adaptation. J. Manuf. Syst. 2023, 68, 523–535. [Google Scholar] [CrossRef]
Yang, D.; Zhou, K.; Qi, F.; Dong, K. Multidomain feature fusion network for fault diagnosis of rolling machinery. Shock Vib. 2022, 2022, 5478274. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: A New Model for Time Series Forecasting. arXiv 2023, arXiv:2211.14730. [Google Scholar] [CrossRef]
Liu, Y.; Wu, J.; He, Y. Efficient Inference for Large Reasoning Models: A Survey. arXiv 2025, arXiv:2503.23077. [Google Scholar] [CrossRef]
Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. Openai o1 system card. arXiv 2024, arXiv:2412.16720. [Google Scholar] [CrossRef]
Jain, N.; Han, K.; Gu, A.; Li, W.-D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv 2024, arXiv:2403.07974. [Google Scholar]
Jang, J.; Kim, J.; Kweon, W.; Yu, H. Verbosity-aware rationale reduction: Effective reduction of redundant rationale via principled criteria. arXiv 2024, arXiv:2412.21006. [Google Scholar] [CrossRef]
Jansen, P.; Wainwri, E.; Chen, X.; Zhao, A.; Xia, H.; Lu, X.; Wang, H.; Chen, Y.; Zhang, W.; Wang, J.; et al. Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning. arXiv 2025, arXiv:2505.16782. [Google Scholar] [CrossRef]
Ji, Y.; Tan, H.; Shi, J.; Hao, X.; Zhang, Y.; Zhang, H.; Wang, P.; Zhao, M.; Mu, Y.; An, P.; et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv 2025, arXiv:2502.21257. [Google Scholar]
Wang, Z.; Li, Y.; Wu, Y.; Luo, L.; Hou, L.; Yu, H.; Shang, J. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. arXiv 2024, arXiv:2402.02658. [Google Scholar]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2567–2577. [Google Scholar]
Kang, Y.; Sun, X.; Chen, L.; Zou, W. C3ot: Generating shorter chain-ofthought without compromising effectiveness. arXiv 2024, arXiv:2412.11664. [Google Scholar] [CrossRef]
Khot, T.; Clark, P.; Guerquin, M.; Jansen, P.A.; Sabharwal, A. QASC: A dataset for question answering via sentence composition. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Herel, D.; Mikolov, T. Thinking tokens for language modeling. arXiv 2024, arXiv:2405.08644. [Google Scholar] [CrossRef]
Huang, Z.; Chen, Z.; Wang, Z.; Li, T.; Qi, G.-J. Reinforcing the diffusion chain of lateral thought with diffusion language models. arXiv 2025, arXiv:2505.10446. [Google Scholar] [CrossRef]
Ishibashi, Y.; Yano, T.; Oyamada, M. Mining hidden thoughts from texts: Evaluating continual pretraining with synthetic data for llm reasoning. arXiv 2025, arXiv:2505.10182. [Google Scholar] [CrossRef]
Park, S.; Jeon, S.; Lee, C.; Jeon, S.; Kim, B.-S.; Lee, J. A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency. arXiv 2025, arXiv:2505.01658. [Google Scholar] [CrossRef]
Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar] [CrossRef]
Barke, S.; Gonzalez, E.A.; Kasibatla, S.R.; Berg-Kirkpatrick, T.; Polikarpova, N. Hysynth: Context-free llm approximation for guiding program synthesis. Adv. Neural Inf. Process. Syst. 2024, 37, 15612–15645. [Google Scholar]
Brandon, W.; Nrusimha, A.; Qian, K.; Ankner, Z.; Jin, T.; Song, Z.; RaganKelley, J. Striped attention: Faster ring attention for causal transformers. arXiv 2023, arXiv:2311.09431. [Google Scholar] [CrossRef]
Burns, B.; Grant, B.; Oppenheimer, D.; Brewer, E.; Wilkes, J. Borg, omega, and kubernetes. Commun. ACM 2016, 59, 50–57. [Google Scholar] [CrossRef]
Cai, T.; Li, Y.; Geng, Z.; Peng, H.; Lee, J.D.; Chen, D.; Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv 2024, arXiv:2401.10774. [Google Scholar] [CrossRef]
Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts. arXiv 2024, arXiv:2407.06204. [Google Scholar]
Chakraborty, M.; Kundan, A.P. Grafana. In Monitoring Cloud-Native Applications: Lead Agile Operations Confidently Using Open Source Software; Springer: Berlin/Heidelberg, Germany, 2021; pp. 187–240. [Google Scholar]
Wei, L.; Wang, W.; Shen, X.; Xie, Y.; Fan, Z.; Zhang, X.; Wei, Z.; Chen, W. MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration. arXiv 2025, arXiv:2410.04521. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, S.; Bai, C.; Wu, F.; Li, X.; Wang, Z.; Li, X. Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration. arXiv 2025, arXiv:2405.14314. [Google Scholar] [CrossRef]
Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. Swe-bench: Can language models resolve real-world github issues? arXiv 2023, arXiv:2310.06770. [Google Scholar]
Wei, A.; Cao, J.; Li, R.; Chen, H.; Zhang, Y.; Wang, Z.; Sun, Y.; Liu, Y.; Teixeira, T.S.F.; Yang, D.; et al. Equibench: Benchmarking code reasoning capabilities of large language models via equivalence checking. arXiv 2025, arXiv:2502.12466. [Google Scholar]
Wei, H.; Yin, Y.; Li, Y.; Wang, J.; Zhao, L.; Sun, J.; Ge, Z.; Zhang, X. Slow perception: Let’s perceive geometric figures step-by-step. arXiv 2024, arXiv:2412.20631. [Google Scholar]
Wei, T.-R.; Liu, H.; Wu, X.; Fang, Y. A survey on feedback-based multi-step reasoning for large language models on mathematics. arXiv 2025, arXiv:2502.14333. [Google Scholar]
Wei, Y.; Duchenne, O.; Copet, J.; Carbonneaux, Q.; Zhang, L.; Fried, D.; Synnaeve, G.; Singh, R.; Wang, S.I. Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv 2025, arXiv:2502.18449. [Google Scholar] [CrossRef]

Figure 1. Integrated workflow of the Vibration-Aware Chain-of-Thought (VA-COT) diagnostic system. Raw triaxial signals undergo noise filtering and multi-domain feature extraction, with dynamic fusion (α, β, γ weights) feeding adaptive reasoning chains. The edge-cloud coordinated architecture enables real-time fault classification (12 types), severity scoring (0.0–1.0), and maintenance actions (Monitor/Alert/Shutdown) under industrial constraints.

Figure 2. Flowchart of the Adaptive Reasoning Depth Control Mechanism.

Figure 3. Accuracy–Latency Pareto curve comparing VA-COT with baseline models. The optimal trade-off is located in the top-left corner. VA-COT provides the best balance of high F1-Score and low inference latency.

Table 1. Component-Specific Fault Diagnosis Models.

Component Model	Fault Diagnosis Model	Description of Covered Faults
Motor Model	Bearing Diagnosis	Damage to inner/outer race, rolling elements, cage.
	Imbalance Diagnosis	Unbalance caused by fan dust accumulation, blade wear, rotor casting voids.
	Misalignment Diagnosis	Misalignment between the motor and the driven machine.
	Looseness Diagnosis	Looseness of bearing seats, anchor bolts, end-shield distortion.
Pump Model	Bearing Diagnosis	Pump-side bearing inner/outer race spalling, roller micro-pitting.
	Imbalance Diagnosis	Impeller coking deposits, non-uniform corrosion of impeller vanes.
	Rubbing Diagnosis	Balance sleeve-drum rubbing, impeller-wear-ring contact.
	Volute/Casing Corrosion	Volute erosion-corrosion of cast-iron volute wall, casing seal-ring fretting.
Coupling Model	Misalignment Diagnosis	Shaft angular/parallel misalignment at coupling, flexible-element wear.

Data from RONDS, https://www.ronds.com/industry/scenes/general (accessed on 8 May 2025).

Table 2. Performance Comparison with Baseline Models.

Model	Accuracy (%) *	F1-Score (%) *	Avg. Latency (s) *	Memory Usage (MB) *
VA-COT	93.2 (±1.1)	92.9 (±1.0)	1.08 (±0.05)	4.4 (±0.1)
Standard COT	89.3 (±1.4)	88.5 (±1.5)	1.32 (±0.06)	4.8 (±0.1)
PatchTST	93.5 (±1.0)	93.1 (±1.1)	4.52 (±0.21)	15.0 (±0.4)
CNN–LSTM	91.2 (±1.3)	90.6 (±1.2)	3.81 (±0.15)	12.5 (±0.3)
TinyVGG-T	90.5 (±1.5)	89.9 (±1.6)	0.61 (±0.04)	2.8 (±0.1)
SVM	88.1 (±1.7)	87.7 (±1.8)	0.45 (±0.03)	3.5 (±0.1)

* Values in parentheses indicate the 95% Confidence Interval (CI) calculated via a 1000-repetition paired bootstrap. Performance metrics were measured on the target edge device.

Table 3. Ablation Study of the VA-COT Framework.

Model Configuration	Accuracy (%)	F1-Score (%)	Latency (s)
VA-COT (Full Model)	93.2	92.9	1.05
VA-COT w/o Adaptive Reasoning	92.5	92.1	2.85
VA-COT w/o Multi-Domain Fusion	90.7	90.1	1.09
VA-COT w/o expert prior	91.5	90.9	1.12

Table 4. Quantitative Explainability Evaluation Results.

Metric	Score/Value
Automated Metric
Avg. Faithfulness Score	0.89 (±0.06)
Human Evaluation (n = 30, 3 raters)
Avg. Correctness (1–5 scale)	4.6 (±0.4)
Avg. Completeness (1–5 scale)	4.4 (±0.5)
Avg. Clarity (1–5 scale)	4.7 (±0.3)
Avg. Actionability (1–5 scale)	4.5 (±0.4)
Inter-rater Reliability (ICC)	0.86
Correlation Significance (t-test p-value)	<0.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, J.; Li, Z.; Zheng, Z.; Lin, Q. Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis. Processes 2025, 13, 2624. https://doi.org/10.3390/pr13082624

AMA Style

Zeng J, Li Z, Zheng Z, Lin Q. Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis. Processes. 2025; 13(8):2624. https://doi.org/10.3390/pr13082624

Chicago/Turabian Style

Zeng, Jinchao, Zicheng Li, Zuopeng Zheng, and Qizhe Lin. 2025. "Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis" Processes 13, no. 8: 2624. https://doi.org/10.3390/pr13082624

APA Style

Zeng, J., Li, Z., Zheng, Z., & Lin, Q. (2025). Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis. Processes, 13(8), 2624. https://doi.org/10.3390/pr13082624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temporal-Aware Chain-of-Thought Reasoning for Vibration-Based Pump Fault Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Evolution of Vibration-Based Fault Diagnosis Techniques

2.2. Interpretable AI and Chain-of-Thought Reasoning

3. Methodology: The Temporal-Aware CoT Framework

3.1. System Architecture and Problem Formulation

3.2. Data Preprocessing and Multi-Domain Feature Engineering

3.3. Fault Expert Prior Modeling

3.3.1. Intelligent Alarm Mechanism

3.3.2. Fault Diagnosis Models

3.4. VA-COT: An Interpretable Reasoning Framework

3.4.1. Adaptive Reasoning Depth Control

3.4.2. The VA-COT Hybrid Reasoning Engine: Concepts and Mechanisms

3.4.3. Execution of the Reasoning Chain: An Algorithmic View

4. Experiment Setup and Evaluation

4.1. Experimental Platform and Datasets

4.1.1. Benchmark Dataset

4.1.2. Industrial Validation Platform

4.2. Baseline Models and Evaluation Metrics

5. Results

5.1. Performance on Benchmark Dataset

5.2. Industrial Validation and Practical Utility

5.3. Quantitative Evaluation of Explainability

6. Discussion

6.1. Interpretation of Key Findings

6.2. The Significance of Interpretability in Industrial Practice

6.3. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Benchmark Dataset Specifications

Appendix A.1. Equipment and Operating Conditions

Appendix A.2. Data Acquisition System

Appendix A.3. Fault Categories

Appendix B. Explainability Metrics and Visual Reasoning Chains

Appendix B.1. Faithfulness Score

Appendix B.2. Human-Evaluation Sample Selection

Appendix B.3. Visual Reasoning Chains

Appendix C. Open-Source Artifacts and Reproducibility

Appendix D. Cross-Model Generalization Experiment

Appendix E. On-Device Resource Profiling

Appendix E.1. Memory Footprint Analysis

Appendix E.2. Power Consumption Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI