1. Introduction
With the increasing integration of the Industrial Internet of Things (IIoT) and intelligent maintenance technologies, the fusion of high-precision perception and real-time decision-making has become crucial for ensuring the safe operation of complex electromechanical systems [
1,
2,
3]. Elevators, as indispensable vertical transportation systems in high-rise buildings, directly impact public safety with their operational reliability [
4]. However, traditional manual inspections and periodic maintenance methods are inefficient, struggle to capture subtle hardware degradation trends during operation, and are hampered by slow response times when dealing with sudden failures. With the advancement of sensor technology, data-driven intelligent fault diagnosis has become a research hotspot [
5]. However, establishing a complete closed loop from physical signals to maintenance decisions remains a significant challenge.
Micro-Electro-Mechanical System (MEMS) sensing technology integrates mechanical elements, sensors, and electronic circuits onto a single chip through micron-scale fabrication processes, offering advantages such as small size, low power consumption, fast response, and ease of mass production [
6]. It has become a core sensing means in the field of industrial monitoring [
7,
8,
9]. In elevator fault monitoring, MEMS-based sensors like accelerometers and microphones can capture microscopic physical changes difficult for traditional macro-sensors to detect, such as specific frequency vibrations from early bearing wear or subtle jitter caused by micron-level guide rail irregularities [
10,
11]. Common elevator fault types can be categorized into mechanical component wear (such as bearing wear and guide shoe wear), drive system anomalies (e.g., traction machine faults and motor current imbalance), guide rail degradation (e.g., uneven rail joints and deformation), door system sticking, and faults induced by environmental factors (e.g., pit water accumulation and temperature and humidity exceeding limits). In MEMS sensor monitoring, these faults manifest as specific physical indicators: for example, early-stage bearing wear can be identified by a vibration acceleration envelope peak >0.12 gE; traction machine misalignment or rotor imbalance can be diagnosed by a vibration velocity RMS > 3.5 mm/s; uneven guide rails can cause the vertical vibration peak inside the car to exceed 0.6 m/s
2; guide shoe wear is reflected by a horizontal lateral vibration A95 value exceeding 0.25 m/s
2; door system sticking is often accompanied by an abnormal sound pressure level increase exceeding 15 dB; electrical faults, such as motor inter-turn short circuits, manifest as a three-phase current unbalance exceeding 10%; inverter faults can be detected by a total voltage harmonic distortion rate exceeding 5%; the risk of pit water accumulation is monitored by a water level sensor with a threshold set at 50 mm; and environmental factors like pit humidity exceeding 85% trigger an environmental alert.
However, MEMS perception systems face a series of inherent limitations in practical industrial deployment, hindering their transition from laboratory accuracy to field usability. Through in-depth analysis of the elevator monitoring scenario, we identify three core limitations of MEMS perception systems in industrial applications. (1) The conflict between data throughput and computational power: The high sampling frequency of MEMS sensors generates massive data streams, imposing stringent requirements on the real-time processing capabilities of edge devices. Traditional cloud-based solutions fail to meet real-time needs due to transmission latency [
12]. (2) Signal non-uniformity and noise interference: In industrial environments, robustness to noise, sensor drift, and varying operating conditions is essential for the practical deployment of fault diagnosis systems [
13]. However, models trained under laboratory conditions often experience significant performance degradation when applied to real-world scenarios [
14]. Environmental noise, electromagnetic interference in industrial settings, along with inherent bias drift and thermal noise of MEMS devices can easily submerge weak fault features [
15]. Furthermore, critical fault signals are often non-uniformly distributed over time [
16], contradicting the uniformity assumption of standard time-series models. (3) The semantic gap between numerical output and business decisions: MEMS sensors only provide physical quantity values such as voltage and frequency, lacking semantic explanations for root causes or the ability to generate specific repair suggestions. This makes it difficult for maintenance personnel to translate monitoring data into executable decisions.
Existing research methods have obvious shortcomings in addressing the above limitations. Traditional machine learning methods, such as Support Vector Machines (SVMs) [
17] and multi-attribute decision making [
18], struggle to handle the long-sequence characteristics and multi-modal correlations of MEMS data. Deep time-series models like LSTM [
19] and the TCN [
20] can capture certain temporal dependencies but suffer from gradient vanishing or limited receptive fields when processing ultra-long sequences as well as from insufficient robustness [
21], and their computational efficiency often fails to meet edge deployment requirements. The standard Transformer architecture [
22], while possessing powerful long-sequence modeling capabilities, has high computational complexity and a uniform positional encoding mechanism. This makes it unsuitable for resource-constrained edge environments and ineffective at capturing the event-driven nature of industrial data. Additionally, most existing diagnostic systems stop at fault classification, lacking the ability to combine physical evidence with domain knowledge for root cause analysis and repair suggestion generation. In industrial intelligent diagnosis, interpretability is critical to credible and applicable diagnostic conclusions. Zhao [
23] argues that model parameters must carry clear physical or statistical significance to realize the mapping from data features to system states.
To systematically address these challenges, this study proposes an industrial diagnostic framework that deeply integrates edge computing with MEMS perception systems. At the edge side, the framework incorporates a lightweight time-series Transformer model named ELiTe-Transformer. It employs an industrial positional encoding mechanism to adapt to the event-driven characteristics of MEMS data and utilizes linear attention and INT8 quantization techniques for efficient inference, directly processing high-frequency data streams from multiple MEMS sensors. In the cloud, retrieval-augmented generation (RAG) technology is introduced to construct a professional knowledge base that integrates industry standards, technical manuals, and historical cases. It converts low-level MEMS sensor signals into physically grounded, traceable, and actionable maintenance decisions, thus bridging the semantic gap. Through edge–cloud collaborative framework, the framework ensures millisecond-level real-time response capability while achieving expert-level diagnostic depth, comprehensively addressing the various limitations of MEMS perception systems in industrial applications.
The main contributions of this paper are as follows:
This work identifies three critical gaps in industrial MEMS perception deployments: data throughput versus computing power, noise versus feature extraction, and value versus decision-making. Accordingly, an end-to-end edge–cloud collaborative framework diagnostic framework was constructed, rather than simply stacking algorithms.
We propose ELiTe-Transformer, a novel lightweight temporal model tailored for the event-driven nature of MEMS data. Diverging from the uniform position encoding of standard Transformers, we designed industrial positional encoding, which physically enhances the model’s sensitivity to weak fault signals. This enables millisecond-level real-time inference on the Jetson edge platform.
We innovatively applied RAG technology to bridge the data-to-decision gap. Extending beyond using LLMs solely for text-based Q&A, we constructed an elevator fault feature-augmented RAG agent. This agent can deeply integrate multi-dimensional MEMS feature chains extracted at the edge with structured domain knowledge to automatically generate reports containing root cause analysis and maintenance recommendations.
This work yielded a deployable intelligent maintenance agent. The research was not confined to simulation; it was validated on real-world elevator datasets and Jetson hardware, providing a complete intelligent operational maintenance solution for industrial MEMS perception.
The remainder of this paper is structured as follows:
Section 2 reviews related work;
Section 3 details the system design and methodology;
Section 4 presents experimental results and discussion;
Section 5 concludes the paper and outlines future research directions.
4. Results
4.1. Experimental Environment and Dataset
To validate the effectiveness of the proposed edge–cloud collaborative framework intelligent operation and maintenance system for elevators based on MEMS-based perception, we established a real experimental environment and constructed an evaluation dataset based on multi-MEMS sensor data collected from elevator sites. The edge side of the system was deployed on an NVIDIA Jetson Xavier NX (NVIDIA Corporation, Santa Clara, CA, USA) platform, configured with a 6-core ARM v8.2 64-bit CPU, a 384-core Volta GPU, and 8GB LPDDR4x memory. The inference framework used was TensorRT 8.4, with model precision set to INT8 quantization. The cloud-side agent was deployed on a server equipped with an NVIDIA A100 GPU used to run the R1-Distill-Qwen-32B large language model and the vector retrieval service for the RAG knowledge base. The software environment utilized Python 3.8, PyTorch 1.12, Transformers 4.20, ChromaDB, etc.
For model training and performance evaluation, we collected continuous operational data over 6 months via deployed MEMS-based accelerometers, microphones, IMUs, and environmental sensors. Approximately 50,000 valid multi-scale sliding window samples were extracted. These were split using a time-based approach into training, validation, and test sets in a 7:2:1 ratio. The final test set contained 2000 samples covering five major categories and a total of 12 specific faults, including bearing wear, abnormal motor current, door mechanism jamming, guide shoe aging, and hoistway water ingress. We selected models widely used in industrial time-series analysis, LSTM [
19], 1D-CNN [
19], TCN [
20], and Transformer [
22], as baselines. All models were trained on the same training set using the AdamW optimizer and cross-entropy loss function. The optimal number of training epochs was determined via an early stopping strategy on the validation set to ensure a fair comparison of edge-side model performance. To specifically test the collaborative diagnostic capability of the cloud-side agent, we further delineated a complex fault subset containing 200 scenarios from the test set. The construction of the complex fault subset is based on two criteria: (i) faults involving multiple subsystems that require cross-sensor correlation for accurate diagnosis and (ii) faults with weak features in a single sensor channel, identified by domain experts based on historical maintenance logs. In practice, we first extracted all samples labeled as known multi-source or weak fault types from the test set, such as early-stage bearing faults and intermittent door jamming. Subsequently, three senior elevator maintenance engineers reviewed these samples to confirm that their correct diagnosis integrated information from at least two different MEMS sensor modalities. We designed three sets of comparative experiments: Group A (No Collaboration) relied solely on the independent detection results of each edge lightweight model, simulating traditional decentralized MEMS monitoring solutions. In Group B (Pure LLM), raw multi-sensor data were fed directly into the R1-Distill-Qwen-32B for feature extraction and fault diagnosis without leveraging edge model preprocessing. Specifically, we converted each sliding window sample into a structured textual prompt. For a given time window, the data from each MEMS sensor was listed in chronological order, presented in a comma-separated format, along with the corresponding timestamps. Group C (proposed system) employed the edge–cloud collaborative framework architecture proposed in this paper, where the small edge model output preliminary detection results and multimodal features, and the large cloud model agent performed comprehensive diagnosis and knowledge-enhanced reasoning. Finally, we conducted a comparison experiment with and without the professional elevator knowledge RAG enhancement for R1-Distill-Qwen-32B to verify the contribution of the RAG knowledge base in improving the accuracy, professionalism, and reliability of the cloud-side agent’s diagnostic reports. We first constructed a test set containing 200 fault scenarios, of which 100 scenarios’ diagnostic conclusions heavily relied on external professional knowledge, and the remaining 100 were relatively general problems. An example of fault analysis is illustrated in
Figure 10 by our system.
4.2. Implementation of the Context-Based Evaluation Framework
In practical applications of industrial inspection equipment, the complex field environments and variable sensor operating conditions often render single global accuracy metrics inadequate for objectively assessing a system’s true efficacy. Particularly when expanding the research perspective to personal inspection or wearable sensing domains, a system’s adaptability to complex semantic contexts becomes critical. To address this, a structured context-based evaluation framework was employed in this study, as detailed in the experimental validation section. Context is explicitly defined as a comprehensive semantic background comprising process stages, spatial environment, signal quality, and network status.
At the operational level, this framework ensures rigorous evaluation through three core dimensions. First, a context-aware preprocessing and calibration mechanism dynamically switches normalization strategies based on perceived environmental contexts. This eliminates distribution shifts caused by sensor adhesion variations or environmental fluctuations, ensuring physical comparability of signals across different inspection sites. Second, the system exhibits high contextual sensitivity. Edge models dynamically adjust the frequency of edge–cloud collaborative framework based on classification confidence and current network link status, prioritizing decision continuity under extreme operating conditions. Finally, for validation, we employed a strict time-based split approach to simulate context drift during actual inspection cycles, thereby testing the model’s generalization robustness against sensor aging or operational state transitions.
Based on this framework, we established a multidimensional metric system to evaluate the system’s application value from various perspectives. We used edge-side inference latency and memory consumption metrics to measure the algorithm’s real-time responsiveness on portable inspection devices. Comprehensive classification performance on the time-series split test set was used to evaluate the system’s reliability when handling operational condition shifts. Additionally, evaluation tools tailored for large language models were employed to assess the practical guidance value of the system’s generated diagnostic reports for inspection personnel, focusing on factual consistency and context recall rates. Through this multidimensional mapping, system evaluation transitions from purely verifying algorithmic accuracy to comprehensively considering decision-making usability within complex industrial scenarios.
4.3. Efficient Processing of MEMS Data by the ELiTe-Transformer Edge Model
To verify the capability of the ELiTe-Transformer model to efficiently process high-frequency, high-dimensional MEMS data streams on resource-constrained edge devices, this section presents a comprehensive comparison of its performance and efficiency against various classical time-series models.
The experimental results, as detailed in
Table 2, show that in terms of diagnostic accuracy, the proposed ELiTe-Transformer model achieves performance comparable to the strongest baseline, the TCN, and is significantly better than LSTM and the 1D-CNN. This demonstrates its powerful ability to model complex temporal patterns in MEMS data while being lightweight. Regarding efficiency metrics critical for edge deployment feasibility, through architectural compression and quantization, the model size is compressed to 9.8 MB, approximately 20.1% of the standard Transformer model’s size. An inference latency of 21.4 ms indicates that the model can process data streams at a frequency exceeding 50 Hz, fully meeting the real-time monitoring requirements for high-frequency MEMS sensors. For further clarification, inference latency refers to the average time taken by the model to perform one forward inference on a single sliding window sample on the Jetson Xavier NX GPU. This metric only reflects the latency of the model computation stage, excluding system-level overheads such as sensor data acquisition, sliding window segmentation, and edge-cloud communication. Although the 1D-CNN achieves lower latency due to the parallel nature of local convolutions, its accuracy lags behind. ELiTe-Transformer achieves a balance between accuracy and efficiency for high-frequency MEMS data streams, specifically addressing the conflict between high data throughput of MEMSs and the limited computing power of edge devices.
4.4. Ablation Study
To validate the key designs in ELiTe-Transformer that address the characteristics of MEMS sensor industrial applications, we conducted an ablation study. The results are shown in
Table 3.
Compared to the standard positional encoding, the industrial positional encoding improved the F1 score by 1.2%, demonstrating that this mechanism can effectively address the challenge of key signals in MEMS data being submerged by background noise. By emphasizing weights around key time points, industrial positional encoding enhanced the model’s sensitivity for fault detection in noisy industrial environments. The linear attention mechanism significantly reduced inference latency from 185.6 ms to 21.4 ms, achieving an 8.7× speedup, which is key to enabling the model to process long MEMS sequences in real-time at the edge. The performance of models using a single window length declined, while the multi-scale window strategy achieved an F1 score 1.5% higher than the best single-scale window, verifying its ability to simultaneously capture both transient anomalies and long-term degradation trends in MEMS data. INT8 quantization reduced the model size by 75% and decreased inference latency by 68.8%, with only a 0.1% loss in accuracy, greatly enhancing the model’s deployment feasibility on resource-constrained edge hardware.
4.5. The Cloud-Side Large Model Agent Bridges the Semantic Gap
To validate the value of the edge–cloud collaborative framework architecture in bridging the semantic gap between MEMS data and operation and maintenance decisions, we conducted tests on a complex fault subset. Based on 300 event samples of faults partially identifiable by single sensors—such as bearing wear, motor current surge, and door mechanism jamming—we designed complex, multi-source fault scenarios encompassing cross-sensor correlations, subtle precursors, and multi-system coupling, totaling 200 scenarios. All samples were manually annotated by domain experts, including primary fault type, component-level root cause, and fault onset time.
As shown in
Table 4, in terms of diagnostic accuracy and root cause identification rate, Group C significantly outperforms both Group A and Group B. This demonstrates that the systematic solution—comprising real-time preprocessing and feature extraction of raw MEMS data by the edge ELiTe-Transformer, followed by deep reasoning performed by the cloud-side agent through fusion of multimodal evidence chains and domain knowledge—can effectively overcome the limitations of single-sensor or single-model capabilities. In complex fault scenarios characterized by weak features and requiring cross-sensor correlation, the edge–cloud collaborative framework demonstrates substantial value.
As shown in
Table 5. For rare faults, the proposed system (Group C) achieves significantly higher F1 scores compared to the other two groups. On complex faults with more subtle features and relatively scarce samples, our method is able to correlate multi-source MEMS information and conduct deep root cause analysis, successfully transforming low-semantic sensor readings into high-value diagnostic insights.
4.6. Contribution Analysis of RAG Knowledge Enhancement
To further verify how the RAG mechanism transforms MEMS physical features into understandable maintenance decisions, we evaluated its impact on the output quality of the cloud-side agent. The results are shown in
Table 6:
In this study, if a specific statement in the generated report lacked clear supporting evidence in the corresponding RAG retrieval results, equipment technical manuals, or historical fault cases, or if it contradicted the conclusions annotated by field experts, the statement was determined to be hallucinated. The hallucination rate was calculated using a semi-automated evaluation process, with the specific steps as follows: First, the diagnostic report generated by the model was split into several atomic statements. Each statement was then compared against the knowledge segments retrieved by RAG. If the key information in a statement could be supported by any retrieved segment or if it contradicted the expert-annotated conclusions, it was marked as a hallucination. The hallucination rate was defined as the proportion of sentences marked as hallucinations among the total number of sentences in the report.
From the perspective of the resulting data, we used the Ragas framework [
46] to automatically evaluate the professionalism of diagnostic reports. The RAG mechanism provided the agent with precise and reliable domain knowledge anchors, significantly improving the diagnostic professionalism score by 30.9%. Examining the component metrics, Faithfulness increased by 43.1%, and the hallucination rate was reduced by 71.4%. This indicates that by correlating the microscopic physical features detected at the edge with industry standards, technical manuals, and historical cases in the knowledge base, RAG effectively endows MEMS data with accurate physical meaning and maintenance context, fundamentally bridging the semantic gap between the MEMS-based perception system and final business decisions.
5. Discussion
5.1. In-Depth Analysis of Ablation Experiments
Ablation experiments reveal the contribution of each key module to system performance. To further explore the working mechanism of each module, we conducted a hierarchical analysis of model performance across different fault types. Industrial positional encoding improves the F1 score by 1.2% over standard positional encoding. A fault-type breakdown shows this mechanism delivers the most significant gains for weak-feature faults (2.3% for early bearing wear and 1.9% for minor guide shoe wear) while improving strong-feature faults such as current anomalies by only 0.4%. This confirms that assigning higher weights to critical time points via Gaussian kernel functions enhances the model’s sensitivity to weak event-driven features submerged in background noise, demonstrating robustness to low signal-to-noise ratio conditions.
Linear attention reduces inference latency to 1/8.7 with only a 0.2% drop in the overall F1 score. Further analysis indicates that performance loss mainly occurs in transient faults lasting less than 1 s, likely because linear attention is inferior to standard attention in modeling local fine-grained temporal relationships. Nevertheless, it maintains accuracy comparable to standard attention for most elevator faults, verifying its suitability for edge deployment scenarios.
The multi-scale window strategy increases the F1 score by 1.5% over the single 24 h window. Analysis shows the 24 h window achieves higher detection accuracy for sudden faults, while the 168 h window provides stronger early warning for progressive degradation faults. Their fusion enables the system to capture both transient anomalies and long-term trends, achieving full coverage of fault types.
INT8 quantization shrinks the model size by 75% with only a 0.1% accuracy loss. Error analysis reveals quantization errors concentrate on low-SNR samples, pointing to future optimization directions: retaining FP16 precision for noise-sensitive layers or adopting quantization-aware training.
5.2. Interpretability Analysis
To evaluate the system’s generalization under varying conditions, we further analyzed model performance across different fault types, noise environments and elevator models. For common faults such as bearing wear and current anomalies, the system F1 score exceeds 92%; for complex coupled faults with limited samples, it still maintains an 89.8% root cause identification rate, proving the RAG mechanism compensates for training data sparsity by retrieving similar historical cases. For noise robustness, industrial positional encoding was designed to address the signal-to-noise ratio challenges of MEMS sensors in industrial environments, thus enhancing fault detection with strong background noise. The 1.2% F1 gain in
Table 3 validates the effectiveness of this mechanism on real industrial data. In addition, the proposed method is model-agnostic: the lightweight ELiTe-Transformer architecture and the RAG knowledge base can be transferred to other elevator models or rotating machinery (e.g., fans and pumps), requiring only fine-tuning of standardization parameters and expansion of the knowledge base. Preliminary cross-domain tests show that the model achieves 82.3% zero-shot diagnosis accuracy on another brand of elevators without fine-tuning, which rises to 94.1% after fine-tuning with a small number of samples.
5.3. System Generalization Capability Analysis
The industrial positional encoding mechanism aligns the model’s attention on time series with key physical events in elevator operation through a Gaussian weight function, giving the model’s enhanced sensitivity to fault signals clear physical meaning. The multi-task output head directly maps complex sensor data to three engineering-semantic dimensions: fault confidence, fault type and severity, completing the structured translation from data to preliminary diagnostic information. The retrieval-augmented generation mechanism takes low-level features extracted at the edge as retrieval cues to match each feature with corresponding domain knowledge annotations from the knowledge base consisting of industry standards and technical manuals.
The final comprehensive diagnostic report integrates all the above interpretive information, which explicitly includes evidence (edge feature chain), basis (knowledge retrieved via RAG) and recommendations (maintenance solutions).
5.4. Limitation Analysis
Despite the satisfactory performance, the system still has several limitations. First, the validation of model generalization is limited. Elevators from different manufacturers, production years and load capacities feature large differences in vibration transmission characteristics and fault feature distributions, requiring further large-scale cross-site and cross-model validation. Second, the current system includes an offline training mode and cannot adapt online to sensor aging, environment drift or the emergence of new fault modes. Although the RAG knowledge base supports dynamic updates, the edge model itself lacks continual learning capability, and online or incremental learning mechanisms will be introduced in the future. Third, the retrieval performance of the RAG module highly depends on the quality and coverage of the knowledge base. The knowledge base constructed in this study is mainly based on public standards, technical manuals and historical work orders, and its support for rare or new faults needs verification. Fourth, the stability of the edge–cloud collaborative framework mechanism in weak-network environments has not been fully tested. When the network is interrupted, cloud analysis functions degrade, and the system can only rely on preliminary edge diagnosis results, which may affect the accurate identification of complex faults.