3.1. Bibliometric Trends and Temporal Evolution
The systematic screening process yielded a final corpus of 89 studies (64 peer-reviewed from WoS/Scopus [
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72,
73,
74,
75,
76,
77,
78,
79,
80,
81,
82] and 25 arXiv preprints [
83,
84,
85,
86,
87,
88,
89,
90,
91,
92,
93,
94,
95,
96,
97,
98,
99,
100,
101,
102,
103,
104,
105,
106,
107]) spanning 2007 to early 2026. The temporal distribution reveals a distinct exponential growth trajectory, particularly pronounced after 2019. The early phase (2007–2018) is characterized by sparse publication activity (n = 10, 15.6% of corpus) and reliance on classical machine learning methods such as Support Vector Machines (SVMs) and Random Forests. Liu et al.’s foundational 2007 study [
19] exemplifies this era, proposing coincidence matrices for performance evaluation—an effective method for linear degradation but limited in high-dimensional feature spaces.
A paradigm shift becomes evident beginning in 2020. Over 42% of included studies (n = 27) were published solely in 2020–2021, coinciding with maturation of Industry 4.0 concepts and widespread adoption of open-source deep learning frameworks. This surge correlates with three enabling factors: (1) availability of large-scale public benchmark datasets (C-MAPSS, IMS Bearing, PRONOSTIA); (2) democratization of GPU computing and cloud-based training infrastructure; (3) proliferation of IoT sensor networks providing rich time-series data [
20,
113,
114].
The most recent literature (2023–2026, n = 17, 26.6%) reflects the current frontier: integration of signal processing with end-to-end learning, multi-modal sensor fusion, and early exploration of Edge-AI deployment. A 2026 study on rotating machinery exemplifies this trend, combining stochastic resonance feedback with Principal Component Analysis (PCA) and enhanced Gini coefficients for early fault detection [
20]. This evolution suggests the engineering community has transitioned from viewing AI as an auxiliary tool to recognizing it as central to infrastructure reliability analysis in complex systems.
Figure 2 presents the temporal distribution of included publications, clearly illustrating acceleration in research activity. Concentration of publications in recent years indicates both growing industrial interest and academic recognition of predictive maintenance as a critical AI application domain.
Table 3 provides a comprehensive overview of the temporal and structural characteristics of the analyzed corpus. The 64 peer-reviewed studies span nearly two decades, from 2007 to 2026, reflecting the sustained and evolving interest in predictive maintenance research. The presence of 45 distinct publication venues indicates a high degree of dispersion, suggesting that the field is interdisciplinary and not confined to a limited set of journals. Notably, all studies include a DOI, ensuring traceability and reproducibility of the review process. The distribution of publications peaks in 2020 and 2021, with 14 and 13 studies respectively, highlighting a period of intensified research activity, likely driven by the consolidation of Industry 4.0 and AI-based maintenance approaches.
To complement the peer-reviewed corpus profile (
Table 3),
Table 3b presents a descriptive overview of the 25 arXiv preprints incorporated as supplementary gray literature [
83,
84,
85,
86,
87,
88,
89,
90,
91,
92,
93,
94,
95,
96,
97,
98,
99,
100,
101,
102,
103,
104,
105,
106,
107]. In contrast to the peer-reviewed corpus, which spans nearly two decades (2007–2026), the preprint cohort is concentrated within a 26-month window (January 2024 to March 2026), providing a focused cross-section of the current research frontier. Thematic coverage reflects emerging priorities absent from the peer-reviewed corpus: Explainable AI constitutes the largest single theme (n = 5, 20.0%), followed by RUL forecasting (n = 4, 16.0%) and three co-equal themes at n = 3 each—foundation models, sensor fusion, and Edge-AI/TinyML. This distribution contrasts notably with the peer-reviewed corpus, in which RUL forecasting and fault detection/diagnosis dominate, and dedicated XAI or foundation model papers are sparsely represented.
The most structurally significant divergence between the two corpora lies in deployment orientation. The arXiv cohort achieves a mean Deployment Readiness Score (DRS) of 1.72—compared with an estimated mean of approximately 0.63 for the peer-reviewed corpus—a nearly three-fold difference. Furthermore, only 4.0% of preprints score DRS = 0, compared with 60.9% of peer-reviewed studies. Conversely, 16.0% of preprints simultaneously address all three deployment dimensions (Edge-AI implementation, real-time inference, and XAI integration), achieving DRS = 3—a proportion that is more than triple the 4.7% reported for the peer-reviewed corpus and achieved within a dramatically shorter two-year timeframe. These patterns indicate that, within the 2024–2026 publication window, the research community has begun to address the deployment gap identified in the earlier peer-reviewed literature, with preprints serving as an early signal of a structural shift in research priorities that has not yet propagated into the indexed database record.
Table 4 identifies the most representative publication outlets within the corpus, revealing both concentration and diversity in dissemination channels.
The International Journal of Advanced Manufacturing Technology leads with six studies, followed by
The Journal of Intelligent Manufacturing with four contributions, positioning these journals as central platforms for predictive maintenance research. Several venues, including
Journal of Manufacturing Systems,
Engineering Applications of Artificial Intelligence, and
IEEE Transactions on Industrial Informatics, contribute two studies each, reflecting their relevance in bridging manufacturing and AI domains. The presence of journals such as
Scientific Reports and
EURASIP Journal on Audio, Speech, and Music Processing illustrates the methodological breadth of the field, encompassing applications of signal processing and interdisciplinary analytical approaches.
The accuracy statistics reported individually by each study (e.g., RMSE, F1-score, precision) are not quantitatively synthesized in this review, as the heterogeneity of performance metrics, reference datasets, and assessment conditions precludes meaningful direct comparisons. Instead, the distribution patterns by thematic cluster and maintenance task are presented in the subsequent tables of this section.
3.2. Taxonomy of Predictive Maintenance Tasks (RQ1)
Through unsupervised thematic clustering using TF-IDF features and k-means (k = 5), we identified five distinct research frontiers within the corpus. Each cluster represents a coherent body of work addressing specific predictive maintenance challenges:
Cluster 0: General PdM and industrial AI (n = 18, 28.1%) encompasses broad machine learning applications across diverse industrial contexts. This cluster exhibits the highest heterogeneity, including studies on multiple equipment types and mixed methodological approaches. Representative terms include “data,” “learning,” “industrial,” “predictive maintenance,” and “Industry 4.0.” This cluster serves as a bridge between specialized domains, often proposing generalizable frameworks applicable across manufacturing sectors [
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38].
Cluster 1: RUL and degradation forecasting (n = 12, 18.8%) forms the most cohesive thematic group, exclusively focused on remaining useful life estimation. The 12 studies in this cluster address RUL prediction for components such as lithium-ion batteries, aero-engines, and bearings. The distinctive challenge is modeling nonlinear degradation trajectories where capacity or performance degrades gradually until critical failure. High-weight terms—”RUL,” “prediction,” “remaining,” “useful life,” “uncertainty,” and “degradation”—reflect emphasis on probabilistic forecasting and uncertainty quantification [
10,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
115,
116]. The prevalence of LSTM and GRU architectures in this cluster confirms the need for long-term temporal memory in degradation modeling.
Cluster 2: Tool wear and machining tool condition monitoring (n = 16, 25.0%) concentrates on subtractive manufacturing processes, particularly CNC machining, milling, and drilling. Tool condition monitoring (TCM) in this context addresses rapid stochastic wear where cutting-edge degradation directly impacts surface quality and dimensional accuracy. Research prioritizes low-latency detection to prevent catastrophic tool breakage during operations. Vibration signatures, cutting-force signals, and acoustic emissions serve as primary data modalities [
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65]. Industrial stakes are high: premature tool replacement increases costs, while delayed replacement results in scrapped parts and potential machine damage.
Cluster 3: Sensors/measurements and niche applications (n = 4, 6.3%) represents specialized applications including acoustic virtual sensors, building energy management, and dimension-based monitoring. Despite the small size, this cluster demonstrates methodological diversity, applying techniques such as Non-negative Matrix Factorization (NMF) for acoustic pattern separation [
21]. This cluster’s presence highlights emerging opportunities for AI-driven predictive maintenance beyond traditional rotating machinery.
Cluster 4: Fault detection/diagnosis and time-series (n = 14, 21.9%) represents the single largest application area, focusing on discrete fault classification tasks. Studies distinguish between specific failure modes (e.g., inner-race vs. outer-race bearing failure) and fault severity levels. Recent approaches pursue fine-grained diagnostics, identifying not merely fault presence but characterizing fault progression under variable operating speeds. CNN architectures dominate this cluster, leveraging convolutional operators to extract spatial features from time–frequency representations (spectrograms, wavelet transforms) [
20,
66,
67,
68,
69,
70,
71,
72,
73,
74,
75,
76,
77,
78].
Table 5 summarizes cluster characteristics, demonstrating clear specialization within the predictive maintenance research landscape. This taxonomy reveals that while foundational tasks (fault detection, RUL prediction) attract sustained attention, emerging areas such as multi-modal fusion and edge deployment remain underexplored.
Figure 3 presents the two-dimensional PCA projection of the TF-IDF feature space, offering a visual validation of the k-means clustering structure (k = 5) and the thematic differentiation described previously. The spatial distribution of points reveals a clear separation between several clusters, particularly Cluster 1 (RUL and degradation forecasting), which appears as a compact and well-defined group in the upper-left region, confirming its high thematic cohesion. Cluster 2 (machining tool wear and TCM) is distinctly located on the right-hand side of the plot, indicating a specialized vocabulary and strong semantic consistency associated with manufacturing processes. In contrast, Cluster 0 (general PdM and industrial AI) is more dispersed around the central region, reflecting its heterogeneous nature and its role as a bridging domain across multiple applications. Clusters 3 and 4 occupy intermediate and partially overlapping positions near the center, suggesting some shared terminology related to sensors, measurements, and fault diagnostics, although still maintaining identifiable groupings. This overall configuration supports the robustness of the clustering approach, while also highlighting varying degrees of thematic cohesion and overlap across predictive maintenance research streams.
3.4. Algorithmic Dominance in Non-Stationary Environments (RQ1)
Cross-tabulation of maintenance tasks against AI method families reveals decisive dominance of deep learning, displacing traditional machine learning in complex applications. Three architectural families emerge as dominant:
Convolutional Neural Networks (CNNs) constitute the standard for fault diagnosis tasks. By transforming raw 1D sensor signals into 2D time–frequency images (via Short-Time Fourier Transform, Continuous Wavelet Transform, or Mel-spectrograms), CNNs extract spatial features invariant to speed fluctuations and load variations. This architectural choice proves particularly effective in Cluster 4 applications, where identifying visual patterns in spectrograms yields superior accuracy compared to hand-crafted statistical features (kurtosis, RMS, spectral kurtosis) [
20,
66,
67,
68,
69,
70,
71,
72,
73,
74,
75,
76,
77,
78].
Recurrent Neural Networks (RNNs/LSTMs) dominate RUL forecasting (Cluster 1). Their ability to retain long-term memory enables modeling of degradation trajectories exhibiting path-dependent aging. For lithium-ion batteries, capacity fade depends on historical charge–discharge patterns, temperature exposure, and discharge depth—complex interactions that LSTM hidden states capture effectively. The 2025 adaptive dual-distillation framework exemplifies current sophistication, transferring knowledge from large LSTM teacher models to lightweight GRU student models for edge deployment [
10,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
115,
116].
Hybrid architectures (CNN-LSTM) represent the emergent frontier in the 2025–2026 literature. These architectures resolve the “feature-temporal” dilemma: CNNs excel at extracting spatial features from multi-channel sensor data, while LSTMs model temporal evolution of these features. Hybrid models apply CNN layers for automatic feature engineering, followed by LSTM layers for sequence modeling. This end-to-end learning paradigm eliminates manual feature engineering while maintaining interpretability of intermediate CNN activations [
20,
21,
22,
23,
24,
25,
26,
27,
28].
The 2024–2026 literature extends this taxonomy with three architectural categories absent from earlier reviews. Foundation models pre-trained on large heterogeneous corpora are entering the field: UniFault [
83] achieves few-shot fault diagnosis across unseen datasets after pre-training on over nine billion vibration samples, and BearLLM [
84] applies a multi-modal language model backbone to nine bearing health benchmarks within a single unified framework. Selective State Space Models (SSMs), specifically the Mamba variant, process sequences in linear rather than quadratic time, a property that directly benefits resource-constrained edge deployment; MambaLithium [
85] reports superior battery RUL, SOH, and SOC estimation relative to LSTM and transformer baselines at lower computational cost. Graph Neural Networks (GNNs) model inter-sensor spatial dependencies that sequential architectures ignore; a recent survey [
86] provides a reproducible benchmark confirming consistent accuracy gains on multi-component RUL tasks. Cross-domain adaptation with fewer than 1% of target-domain labels has been demonstrated through parameter-efficient fine-tuning strategies [
87], partially addressing the data-scarcity barrier identified throughout this review.
Traditional machine learning (SVM, Random Forest, k-NN) persists primarily in studies addressing computational constraints or interpretability requirements. These methods offer faster training, lower inference latency, and inherent explainability—advantages remaining relevant for edge deployment scenarios with limited resources [
108,
109,
113,
114].
Figure 4 presents the evidence matrix (Task × Method Family), with cell intensity indicating study frequency. The heatmap confirms DL saturation in RUL and fault diagnosis, while exposing underexplored combinations (e.g., generative models for synthetic fault data augmentation, Reinforcement Learning for adaptive maintenance scheduling).
Table 6 synthesizes the distribution of predictive maintenance tasks across the corpus, highlighting clear imbalances in research focus, methodological preferences, and validation rigor. RUL forecasting dominates the landscape with 42.2% of studies consistently relying on temporal signals such as vibration and electrical data, and leveraging sequential deep learning models like LSTM and hybrid architectures, typically validated through temporal splits or cross-validation. Tool condition monitoring also represents a substantial share (25.0%), characterized by real-time evaluation settings and multi-sensor inputs. In contrast, tasks such as anomaly detection and failure prediction remain underrepresented, despite their practical relevance. A critical pattern emerges in validation strategies, where the predominance of k-fold cross-validation and limited use of realistic deployment scenarios suggests a potential gap between experimental performance and real-world applicability, motivating the need for more rigorous and standardized evaluation frameworks.
3.5. The Validation Crisis: Rigor Analysis (RQ2)
Addressing RQ2,
Figure 5 presents validation scheme evaluation, exposing a critical methodological gap. Ideally, models destined for non-stationary environments should employ
Tier 3 protocols: temporal splits respecting chronological order, or cross-domain validation on completely external datasets reflecting distribution shift. However, our analysis reveals that 34.4% of studies (n = 22) fall into
Tier 0 (unclear)—validation methodology not explicitly reported or ambiguously described in manuscript text.
Among studies with specified validation, Tier 1 (simple random split) represents 31.2% (n = 20), Tier 2 (k-fold cross-validation) 23.4% (n = 15), and only 10.9% (n = 7) achieve Tier 3 rigor. This distribution indicates systematic underreporting and methodological weakness. Random data splitting—regardless of k-fold repetition—introduces temporal leakage in time-series contexts. Models trained on randomly sampled points from the same operational cycle inevitably learn cycle-specific background noise, sensor biases, and equipment signatures rather than generalizable fault patterns [
14,
113,
114,
117].
The arXiv preprint cohort provides direct empirical amplification of the validation concerns documented above. Of the 25 preprints, eight (32.0%) were classified as exhibiting high relevance to RQ2 (temporal validation rigor), with an additional 13 (52.0%) demonstrating medium relevance—a combined 84.0% of the gray literature corpus engaging substantively with validation methodology. This concentration stands in marked contrast to the peer-reviewed corpus, where only 10.9% (n = 7) achieve Tier 3 rigor and 34.4% remain in the Tier 0 (unclear) category. Four thematic categories within the preprint cohort are directly responsive to the validation weaknesses identified in this review: studies quantifying the magnitude of data leakage in temporal prediction tasks [
88]; comparative evaluation of walk-forward and sliding-window temporal cross-validation schemes [
89]; model-agnostic concept drift-detection approaches requiring substantially fewer labels than prior methods [
90]; and online continual learning frameworks for non-stationary time-series [
91]. Taken together, these preprints signal that the research community has recognized the validation crisis and is actively developing targeted methodological solutions—responses that have not yet permeated the peer-reviewed corpus as of the January 2026 search cutoff.
Quantitative evidence published in 2024–2025 directly corroborates the validation weaknesses identified in this corpus. Albelali and Ahmed [
88] measure how data leakage inflates LSTM performance across partitioning strategies, finding RMSE degradation of up to 20.5% in 10-fold CV when lag windows span the split boundary, while two-way and three-way chronological splits hold bias below 5%. Hespeler et al. [
89] at Oak Ridge National Laboratory compare walk-forward and sliding-window temporal CV on multivariate anomaly detection tasks, observing that sliding-window schemes produce higher median AUC-PR and lower inter-fold variance across deep learning architectures. Both studies provide the empirical grounding that the Tier 0/Tier 1 distribution (65.6% of this corpus) has so far lacked. For deployment in non-stationary environments, static train/test splits are insufficient by design; CDSeer [
90] addresses this by detecting when a model’s operating distribution has shifted enough to require retraining, doing so with 99% fewer labels than prior drift-detection methods. NatSR [
91] takes a complementary approach, framing time-series forecasting as an online continual learning problem where model parameters update as new operational data arrive.
Consequences manifest as optimistic performance bias. Many DL models report > 99% accuracy on test sets drawn from the same equipment instance and operational period as training data. When deployed on different equipment or under altered conditions, these models exhibit catastrophic performance degradation—the “lab-to-factory gap” [
108,
109,
113,
114]. For example, a bearing fault classifier trained and tested on NASA’s IMS dataset may achieve 98% accuracy but fail completely on industrial bearings operating under different speeds, loads, or lubrication regimes.
Heavy reliance on
synthetic or laboratory benchmark datasets (C-MAPSS, IMS Bearing, PRONOSTIA) compounds validation weaknesses. While valuable for algorithmic comparison and baseline establishment, these benchmarks lack: (1) stochastic environmental noise (electromagnetic interference, temperature fluctuations); (2) sensor failures and missing values; (3) operational regime changes (speed variations, load transients); (4) simultaneous multiple faults; (5) long-term sensor calibration drift [
14,
108,
109,
113,
114,
117].
Only a minority of studies explicitly employ
Leave-One-Group-Out (LOGO) cross-validation—training on N-1 equipment instances and testing on the withheld instance—or validate on completely external industrial datasets. These approaches, while methodologically rigorous, demand larger data collection efforts and longer experimental campaigns, creating practical barriers to academic publication [
108,
109,
113,
114].
Figure 5 (heatmap of Task × Validation Tier) highlights polarization between high-rigor evidence and Tier 0 reporting opacity across all task categories. This finding motivates our call for standardized validation-reporting requirements and tier-based evidence synthesis in future reviews.
Formal sensitivity analyses were not performed, as this was a narrative synthesis without underlying meta-analysis. The robustness of the Validation-Level distribution was qualitatively verified by independent reclassification of a random sample of 10% of the studies by D.A.G.A., obtaining complete agreement with the original assignments.
A clear imbalance emerges in the use of benchmark datasets across predictive maintenance tasks, reflecting differing levels of standardization and methodological maturity. RUL forecasting and fault diagnosis rely heavily on a limited set of widely adopted datasets, such as C-MAPSS, IMS Bearing, and CWRU, with usage rates exceeding 70%, which facilitates comparability but may restrict generalizability. In contrast, tool monitoring and condition monitoring exhibit greater diversity by combining public benchmarks with proprietary industrial datasets, indicating a closer alignment with real-world applications. Failure prediction remains the least standardized, as it depends entirely on custom and synthetic datasets, limiting reproducibility. These patterns are systematically summarized in
Table 7, highlighting a structural trade-off between consistency and applicability in the field.
3.6. Deployment Readiness Edge-AI, Real-Time Inference, and Explainability (RQ3)
Deployment readiness assessment reveals substantial maturity gaps between algorithmic development and industrial operationalization. Applying our three-indicator scoring framework (Edge-AI implementation, real-time inference reporting, XAI integration), 60.9% of studies (n = 39) score 0—providing no deployment consideration evidence. Only 4.7% (n = 3) achieve the maximum score of three, demonstrating simultaneous attention to edge constraints, latency requirements, and explainability.
Edge-AI Adoption: Merely 18.8% of studies (n = 12) explicitly report edge device deployment or discuss computational optimization for resource-constrained environments. These studies employ techniques such as model compression (pruning, quantization), knowledge distillation, or lightweight architecture design (MobileNet variants, SqueezeNet) [
10,
20,
21,
108,
109,
113,
114,
115,
116]. The 2025 adaptive dual-distillation framework exemplifies best practices, achieving 5.34× compression (83% parameter reduction) while maintaining predictive accuracy [
10]. Edge deployment enables local processing, reduces cloud dependency, minimizes bandwidth consumption, and achieves sub-100 ms latency critical for real-time interventions [
108,
109,
113,
114,
118,
119,
120].
However, most of the literature proposes architectures incompatible with edge hardware constraints. Deep models with millions of parameters requiring GPU acceleration and substantial memory cannot run on typical industrial edge devices (ARM Cortex microcontrollers, FPGAs, entry-level AI accelerators such as NVIDIA Jetson or Google Coral). This disconnect reflects academic focus on maximizing accuracy rather than optimizing the precision–efficiency Pareto frontier [
17,
108,
111,
112,
119,
120].
Aggregate analysis of the arXiv preprint cohort reveals a markedly accelerated deployment orientation relative to the peer-reviewed corpus. The 25 preprints achieve a mean Deployment Readiness Score of 1.72, compared with an estimated mean of approximately 0.63 for the 64 peer-reviewed studies—a difference of 1.09 DRS points representing 36.5% of the full scale. The proportion of studies with DRS = 0 (no deployment-relevant content) collapses from 60.9% in the peer-reviewed corpus to 4.0% among preprints; conversely, the proportion achieving DRS = 3 increases from 4.7% to 16.0%. Three thematic clusters concentrate this deployment maturity: (1) Edge-AI/TinyML preprints (n = 3, all DRS = 3) reporting end-to-end hardware validation on microcontrollers and FPGAs [
92,
93,
94]; (2) XAI preprints (n = 5), of which four achieve DRS ≥ 2, including neuro-symbolic deployments validated on live transit infrastructure [
97,
98,
99]; and (3) sensor fusion preprints (n = 3) addressing multi-modal integration under real-world industrial conditions [
106]. These findings indicate that the deployment gap documented in the peer-reviewed corpus is actively narrowing in current research output, with the gray literature providing an early—and systematically more deployment-mature—view of the field’s current trajectory.
Concrete hardware deployments published in 2025 bound the feasible operating region for edge-compatible PdM models. Langer et al. [
92] report end-to-end validation of an 8 bit quantized CNN on an ARM Cortex-M4F microcontroller: 100% diagnostic accuracy on a milling dataset, 15.4 ms per inference, and 1.462 mJ per decision, with a total parameter footprint of 12.59 kiB. BearingPGA-Net [
93] demonstrates FPGA deployment of a knowledge-distilled bearing fault classifier, reporting more than 200× throughput improvement over CPU execution with less than 0.4% accuracy loss relative to the full teacher model. A systematic survey of quantization methods for microcontrollers [
94] covers ARM Cortex-M, RISC-V, and dedicated neural accelerator platforms, cataloging the trade-offs between bit-width reduction and task accuracy across manufacturing-relevant benchmarks. Taken together, these results define reference thresholds—sub-16 ms latency, sub-2 mJ per inference, sub-13 kiB storage—that can be used as minimum acceptance criteria within the Deployment Readiness Score proposed in
Section 2.4.
Real-Time Inference Capability: Only 26.6% of studies (n = 17) report inference latency, throughput, or demonstrate explicit real-time operational validation. Manufacturing process control operates on millisecond timescales. CNC tool wear progression occurs in seconds; bearing failures develop over minutes to hours; intervention windows may span mere seconds between anomaly detection and catastrophic failure [
17,
108,
111,
112,
119,
120]. Models exhibiting inference latency exceeding these windows—regardless of accuracy—provide no actionable value. Yet, inference time remains underreported, with only 26.6% of studies characterizing computational performance.
Real-time systems demand worst-case latency predictability, not merely average performance. Runtime variability—caused by OS scheduling, garbage collection, or thermal throttling—may render models unusable even if average latency meets requirements [
17,
111,
112,
119,
120]. Edge deployment mitigates some latency-variability sources (eliminating network communication delays, cloud service queues) while introducing others (concurrent-process resource contention, processor thermal acceleration) [
108,
109,
113,
114,
118,
119,
120].
Explainable AI (XAI) Integration: Most concerning, only 15.6% of studies (n = 10) integrate explainability mechanisms. The remaining 84.4% treat models as black boxes, providing predictions without interpretable justification. This opacity presents insurmountable barriers in regulated industries (aerospace AS9100, automotive IATF 16949, pharmaceutical cGMP) where certification authorities demand decision-making transparency [
16,
109,
110,
111,
112,
118].
XAI techniques applicable to predictive maintenance include: (1)
attention visualization revealing which temporal windows or sensor channels drive predictions; (2)
SHAP (SHapley Additive exPlanations) attributing prediction contributions to individual features; (3)
LIMEs (Local Interpretable Model-agnostic Explanations) approximating local decision boundaries; (4)
concept activation vectors identifying human-comprehensible concepts learned by networks; (5)
rule extraction from trained models generating IF-THEN logic comprehensible to operators [
16,
109,
110,
111,
112,
114,
118].
Siemens’ technical report on industrial XAI emphasizes that explainability is essential across the AI lifecycle—from business-case development to model monitoring and maintenance [
110]. Explainability facilitates: (1)
confidence calibration enabling operators to develop appropriate reliance on AI recommendations; (2)
fault diagnosis enabling identification of model weaknesses or data-quality issues; (3)
regulatory compliance meeting transparency and human oversight mandates; (4)
continuous improvement through collaborative human–AI refinement; (5)
knowledge transfer from AI systems back to human domain experts [
16,
109,
110,
111,
112,
118].
The XAI literature for predictive maintenance has diversified substantially since 2024, moving beyond SHAP and LIME toward approaches that generate operator-actionable output. A PRISMA review of XAI methods in PdM [
95] documents that attribution-based techniques currently dominate but highlights the absence of any consensus metric for explanation quality—a gap that limits objective comparison of XAI methods in the same way that inconsistent validation schemes limit comparison of predictive models. Counterfactual methods [
96] reframe the explanation task from attribution to intervention: rather than identifying which features drove a prediction, they identify the minimum operational change that would have altered the outcome, a formulation directly useful for maintenance scheduling. Gama et al. [
97] demonstrate a neuro-symbolic architecture on the Metro do Porto transit system in which an autoencoder detects anomalies while a companion rule-learner generates IF-THEN logic that operators can inspect and audit. A 2026 survey of neuro-symbolic approaches to PdM [
98] and independent work from ETH Zurich and EPFL on unsupervised XAI-guided diagnosis [
99] show that symbolic reasoning components are being integrated into deep models at an increasing rate. The 15.6% XAI adoption figure reported for the 2007–2024 corpus therefore represents a historical baseline, not the current trajectory.
Table 8 summarizes deployment readiness assessment, revealing substantial gaps between laboratory demonstrations and plant-floor applicability. This finding underscores urgent need for deployment-oriented research beyond mere algorithmic novelty.
A strong pattern of metric homogenization is evident across predictive maintenance tasks, suggesting a limited diversity in evaluation practices. Accuracy overwhelmingly dominates as the primary performance metric in all task categories, with shares ranging from 72.7% in condition monitoring to 90.9% in failure prediction, indicating a near-universal reliance on a single indicator. Additionally, each task reports only one unique metric, reinforcing the lack of methodological variation in performance assessment. While this uniformity simplifies comparison across studies, it also raises concerns about the adequacy of accuracy for capturing task-specific complexities, particularly in imbalanced or time-dependent scenarios. These findings, detailed in
Table 9, point to a critical need for more nuanced and task-appropriate evaluation frameworks.
Sensitivity analysis by literature type (RQ3 robustness). Because the conclusions of this review integrate a peer-reviewed corpus (n = 64) with an arXiv preprint cohort (n = 25) whose quality has not been externally certified, a sensitivity analysis by literature type was conducted to determine whether the principal findings depend on the inclusion of gray literature evidence.
Scenario A—Peer-reviewed corpus only (n = 64). Re-computing the validation rigor and deployment readiness distributions using only the peer-reviewed studies yields the same concentrations reported in
Section 3.5 and
Section 3.6: 65.6% at Tier 0–1, 10.9% at Tier 3, 60.9% at DRS = 0, 4.7% at DRS = 3, and 15.6% XAI adoption. The validation crisis and deployment gap findings are therefore independent of the preprint cohort and rest entirely on certified peer-reviewed evidence.
Scenario B—arXiv preprint cohort only (n = 25). The 25 preprints, treated as an independent non-peer-reviewed cohort, yield a mean DRS of 1.72, a DRS = 0 share of 4.0%, and a DRS = 3 share of 16.0%, together with 100% XAI coverage in the preprints selected through the semantic search. These figures describe the gray literature cohort on its own terms and are not aggregated with the peer-reviewed numbers.
Scenario C—Claim stability under preprint exclusion. The central comparative claim of this review—that the preprint cohort exhibits substantially higher deployment readiness than the peer-reviewed corpus—is tested by re-expressing it under a stricter rule. Excluding the five preprints with the highest methodological variability (those reporting bespoke non-standardized hardware benchmarks without third-party replication, n = 5), the mean DRS of the remaining 20 preprints drops from 1.72 to approximately 1.40, which is still about 2.2× the peer-reviewed mean (0.63) and preserves the direction and order of magnitude of the original claim. Excluding every preprint reporting DRS = 3 (n = 4), the conservative residual mean DRS is approximately 1.48, still more than 2.3× the peer-reviewed mean. The direction of the finding is therefore robust to preprint-quality adjustment; what a stricter reading modifies is the precise magnitude, not the sign, of the difference.
Interpretation. The preprints are interpreted in this review as signals of the current research frontier rather than as validated evidence of deployed practice, and the strength of the arXiv-vs-peer-reviewed difference reported in
Section 3.6 is conditional on the preprints being subsequently ratified through peer review. Until that ratification occurs, the precise magnitudes attached to the preprint cohort should be read as upper bounds on the publishable state-of-the-art, not as settled population parameters. The central claims of this review are restated here explicitly under this constraint: (i) the validation crisis and deployment gap in the peer-reviewed corpus are documented with peer-reviewed evidence only and do not depend on the preprint cohort; (ii) the preprint cohort provides independent, though non-peer-reviewed, corroboration that the research community is actively addressing the gap; and (iii) the three-fold difference in mean DRS between cohorts, though stable in direction under sensitivity analysis, should be monitored as preprints migrate into indexed peer-reviewed publications over the next 12–24 months.