1. Introduction
Failures in complex industrial systems can lead to substantial financial losses, extended operational downtime, and critical safety risks. To mitigate these risks, predictive maintenance (PdM) has therefore emerged as a proactive strategy that anticipates potential faults and schedules interventions before failures occur [
1,
2]. While traditional PdM approaches typically rely on machine learning models trained with large volumes of labeled data, they assume the availability of sufficient and balanced samples from both healthy and faulty system states [
3,
4]. However, in real-world industrial environments, failure events are inherently rare, expensive to replicate, and difficult to label—highly imbalanced datasets. This imbalance significantly undermines the generalization performance of standard classifiers, often leading to high false-negative rates, particularly when detecting rare but critical failures [
5,
6].
Moreover, acquiring a substantial amount of labeled failure data is especially challenging in safety-critical or high-reliability systems, where faults occur rarely and only over long operational lifespans [
7,
8]. Even when massive volumes of sensor data are collected, the vast majority of observations represent normal operating conditions, offering limited insight into the actual failure dynamics. In addition, conventional diagnostic pipelines often rely on manual feature engineering and extensive model tuning, which are not only time-consuming but also heavily reliant on domain expertise [
9,
10,
11].
Machine maintenance can be effectively planned and potential failures can be predicted by leveraging data collected from internal and/or external sensors, which provide continuous insight into the system’s operational health [
7,
12]. In a typical predictive maintenance pipeline, raw sensor data is first recorded, then preprocessed through noise filtering and, when necessary, segmented into windows. Subsequently, feature extraction is performed to derive informative representations from the preprocessed signals. Finally, an appropriate classification or regression model is trained to learn the mapping between these features and their corresponding labels [
9,
13].
Recent progress in deep learning (DL) has transformed many aspects of predictive maintenance (PdM). Convolutional neural networks (CNNs), in particular, have become prominent because they can automatically extract multi-level feature representations from raw sensor inputs [
14]. Despite their power, CNN-based models typically require large, well-balanced labeled datasets to achieve robust performance. When trained on limited or imbalanced data, they are highly susceptible to overfitting, which undermines their generalization ability and, consequently, limits their practical deployment in real-world industrial settings [
15,
16].
To address the limitations of conventional DL in data-scarce settings, transfer learning has emerged as a powerful and practical alternative. Within this approach, deep networks pre-trained on extensive datasets—such as ResNet [
17,
18] models originally trained on large-scale image collections like ImageNet [
19]—are adapted to new tasks with relatively small sets of labeled samples. These models retain generalizable low-level features (e.g., edges, textures) from their source domain, which can be effectively reused in new tasks. By leveraging this knowledge, transfer learning not only reduces the need for extensive labeled data but also significantly enhances model performance in low-resource and highly imbalanced environments [
20,
21].
Many recent studies have used supervised or semi-supervised models for failure detection, but they usually require large and balanced datasets. In real industrial systems, failure samples are rare, so these methods often perform weakly. Most works depend on extracted features by the general approach or complex ensembles, and only a few explore the use of pre-trained convolutional networks to learn from limited data. In addition, many models are evaluated without cost-aware metrics, even though missing a failure is far more critical than a false alarm. Thus, there is a clear need for a scalable and data-efficient solution. To address this need, an image-based transfer learning framework designed for highly imbalanced failure prediction and suitable for edge deployment has been proposed in this study.
In this study, we propose an efficient failure detection framework that leverages transfer learning with a ResNet-18 architecture. To enable compatibility with image-based CNN models, one-dimensional sensor signals are first transformed into two-dimensional representations using a signal-to-image conversion process. The resulting images are then utilized to fine-tune a pre-trained ResNet-18 for fault classification.
We evaluate our method on the Scania Trucks Air Pressure System (APS) dataset [
22,
23], a publicly available benchmark that presents an extreme class imbalance scenario. The training set comprises 60,000 samples, with only 1000 (1.66%) belonging to the positive class (Failure Type 2, FT2) and the remaining 59,000 to the negative class (Failure Type 1, FT1), reflecting a highly imbalanced distribution.
This dataset exhibits extreme class imbalance, with only 1.66% of the training samples corresponding to failure events. This imbalance presents a realistic and challenging testbed for assessing the effectiveness of transfer learning in failure prediction tasks.
To further align with emerging trends [
24,
25] in edge intelligence and edge–cloud cooperation, this work contributes to the broader discourse on scalable AI for industrial automation. The proposed transfer learning-based framework offers a lightweight yet powerful alternative to conventional deep learning, enabling reliable inference under computational and data constraints typical of edge environments. By minimizing training requirements while maintaining high predictive accuracy, our approach demonstrates how edge-deployable AI models can support real-time predictive maintenance in smart manufacturing and industrial IoT systems. While we used a standard laptop for testing, the low parameter count and small memory footprint of the model indicate that it can be implemented on typical edge devices such as industrial micro-computers or ARM-based platforms.
The main contribution of the paper is three-fold.
We present a novel signal-to-image transformation pipeline that enables the use of one-dimensional sensor signals as RGB image inputs for convolutional neural networks (CNNs).
We leverage a pre-trained ResNet-18 architecture to transfer low-level features (e.g., edges, color gradients, textures), improving failure classification performance in the presence of highly imbalanced data.
We demonstrate that even when fine-tuned on a restricted subset of the training data, the proposed model achieves strong classification performance, highlighting its effectiveness in low-resource industrial settings.
The remaining sections of the paper are structured as follows: in
Section 2, we have summarized the literature. In
Section 3, we have described the dataset, metrics, the background methods including CNN and ResNet18 and the proposed method. In
Section 4, experimental results are presented. In
Section 5, the paper is concluded.
2. Literature Review
In recent years, the application of machine learning (ML) and DL methods for failure detection has gained significant momentum, particularly in the context of industrial applications. Accurate and timely failure identification is essential for minimizing unplanned downtime, reducing maintenance costs, and maintaining overall system reliability. However, one of the most persistent challenges in this domain is the imbalanced nature of failure datasets, where positive (faulty) instances are significantly outnumbered by normal ones. This section reviews key contributions to the field and highlights research gaps that motivate our proposed approach.
Freitas et al. (2023) proposed a three-stage data-driven framework for PdM in commercial vehicle turbochargers, emphasizing the importance of data preparation and domain understanding [
26]. Although their work laid a solid foundation for model development, it stopped short of implementing ML algorithms for actual failure detection, leaving a gap in predictive performance evaluation using real-world datasets. In addition, the framework does not provide guidance on how the extracted features should be integrated into scalable, cost-efficient models, which limits its practical use in industrial deployment scenarios.
To address the issues of high dimensionality and imbalance, Mao and Cheng (2023) introduced a Modified Mahalanobis–Taguchi System (MMTS) [
27]. Their approach combined ReliefF feature ranking and particle swarm optimization (PSO) to construct an effective classification model tailored to detect air pressure system (APS) faults in trucks. The MMTS demonstrated superior accuracy over conventional models, particularly in imbalanced scenarios, validating the efficacy of hybrid optimization for robust classification. However, the method still relies heavily on features by defined experts and assumes that the feature selection process can consistently capture rare failure patterns, which may limit its generalization in unexpected industrial conditions.
From a model interpretability perspective, Farea et al. (2025) developed an Explainable Boosting Machine (EBM) framework to detect APS failures using operational driving data from heavy-duty vehicles [
28]. Their model obtained an accuracy of 91.4% and an F1-score of 0.80 while offering interpretable outputs that align with domain expert knowledge. This work underscores the growing importance of explainable artificial intelligence (XAI) in high-risk applications such as predictive maintenance. Thus, the reported performance still depends on the availability of representative failure examples, and the model may disturbance to generalize when only a very limited number of faulty samples are available in real-world.
Complementary to this, Mumcuoglu et al. (2024) proposed a semi-supervised anomaly detection architecture that leverages Long Short-Term Memory Autoencoders (LSTM-AE) and Transformer-based models (TranAD) [
29]. By integrating Human Expert Analysis (HEA) with deep learning, their hybrid model achieved a 92.8% accuracy and an F1-score of 0.82, showing clear benefits in reducing false alarms and enhancing trustworthiness in imbalanced settings. At the same time, the reliance on expert feedback limits scalability and may not be feasible for continuous deployment.
Beyond model design, Shyalika et al. (2024) investigated the role of data enrichment techniques—such as augmentation, sampling, and imputation—in enhancing the performance of rare event prediction in manufacturing [
30]. By applying these strategies across five industrial datasets and evaluating 15 learning models, their framework demonstrated that data enrichment can boost F1-scores by up to 48%, emphasizing the need for robust preprocessing in highly imbalanced environments. However, the benefit of these methods largely depends on how realistically the synthetic data represents actual failures, which is not always easy to ensure.
While these studies represent significant progress in failure detection, several gaps remain. Most approaches rely on either supervised or semi-supervised architectures and lack generalization capabilities across different domains or failure types. Few studies explore the potential of transfer learning for cross-domain feature adaptation in imbalanced settings. Moreover, although explainability has gained traction, many high-performing models remain opaque or overly complex for real-time industrial deployment.
In response to these challenges, our proposed method integrates transfer learning with uncertainty-aware classification and a re-routing mechanism for low-confidence predictions. This design not only improves detection performance under class imbalance but also mimics human-like decision reassessment, contributing to more interpretable and reliable fault diagnosis in critical systems.
3. Materials and Methods
In this section, we present the methodological framework developed to address the challenges posed by highly imbalanced and data-constrained fault detection scenarios. The proposed approach combines transfer learning with a cost-sensitive classification strategy, enabling reliable failure prediction even with severely limited failure samples. To this end, one-dimensional sensor signals are transformed into two-dimensional color-encoded representations, making them compatible with convolutional neural networks. A pre-trained ResNet-18 architecture is subsequently fine-tuned on the converted data to capture relevant failure patterns while minimizing the risk of overfitting. The methodology is organized into five main components: (1) dataset description and preprocessing, (2) evaluation metrics tailored for imbalanced classification, (3) baseline CNN model, (4) ResNet-based transfer learning architecture, and (5) the proposed classification framework. Each component is detailed in the following subsections.
3.1. Dataset
The dataset employed in this study is the APS Failure at Scania Trucks dataset, which is openly accessible through the UCI Machine Learning Repository [
22,
23]. It contains sensor measurements obtained from heavy-duty Scania trucks operating under real-world conditions.
The focus of the dataset is the Air Pressure System (APS), a subsystem responsible for supplying compressed air to essential vehicle functions such as braking and gear shifting.
The dataset provides labeled examples for two classes based on expert evaluation. The positive class corresponds to failures occurring within a specific APS component, whereas the negative class refers to issues arising in other truck components unrelated to the APS. This dataset is a refined subset of a larger data collection, selected to ensure high relevance and reliability for predictive maintenance research.
The training set comprises 60,000 samples, with only 1000 (1.66%) belonging to the positive class (Failure Type 2, FT2), and the remaining 59,000 to the negative class (Failure Type 1, FT1), reflecting a highly imbalanced class distribution. The testing set contains 16,000 samples, similarly distributed. Each sample is described by 171 numerical features obtained from various sensors. However, there is a lot of missing data in the dataset. Six attributes with a large number of deficiencies were discarded. Other missing data was completed with the average of the numbers in the relevant feature.
3.2. Metrics
Evaluating classifier performance on imbalanced datasets requires more than simple accuracy, which can be misleading when the majority class dominates. Instead, a range of performance metrics tailored for class imbalance is adopted in this study. Let the confusion matrix be defined as the following table.
The following metrics have been computed from
Table 1.
3.2.1. Precision (P)
Measures the proportion of correctly predicted positive samples among all predicted positives:
High precision indicates a low false alarm rate, which is essential in cost-sensitive industrial systems.
3.2.2. Recall (R)
Also termed sensitivity or the true positive rate, this measure reflects the model’s capability to correctly recognize genuine failures:
In predictive maintenance, recall is critical because missing a failure (false negative) could result in significant downtime or safety risk.
3.2.3. Specificity (S)
Also referred to as the true negative rate, specificity evaluates how well the model avoids false alarms:
This is especially important for avoiding unnecessary maintenance in healthy vehicles.
3.2.4. F1-Score (F1)
F1-score is computed as the harmonic mean between precision and recall:
This metric balances false positives and false negatives, making it suitable for imbalanced classification problems.
3.2.5. Accuracy (Ac)
In classification studies, accuracy is a standard metric that reflects how many instances were correctly labeled out of all predictions made. Mathematically, it is defined as
3.2.6. Negative Predictive Value (NPV)
Reflects the model’s ability to assign the negative label accurately among all predicted negatives:
A high NPV suggests reliability in identifying healthy (non-failing) vehicles.
3.2.7. Cost-Sensitive Metric
In real-world predictive maintenance applications, the costs associated with misclassification are inherently asymmetric. In particular, misclassifying a failed truck (FT2) as healthy (FT1) can result in severe safety risks, unexpected breakdowns, and significant financial losses. In contrast, misclassifying a healthy truck as failed leads primarily to unnecessary maintenance costs and minor operational inefficiencies. To better reflect this asymmetry in evaluation, we define a cost-sensitive loss metric as follows:
Here, the advised values [
31] of the constant parameters by the literature have been utilized.
represents the cost of a false positive (i.e., misclassifying a healthy truck as faulty),
= 500 denotes the cost of a false negative (i.e., failing to detect a true fault). This sharp cost contrast (
≫
) emphasizes the critical importance of minimizing false negatives in safety-critical systems such as industrial vehicle fleets.
This asymmetric weighting scheme places greater emphasis on false negatives, aligning the evaluation metric with the real-world consequences of undetected failures. By assigning a significantly higher penalty to missed fault predictions, the cost metric offers a more realistic and deployment-oriented assessment of the model’s practical utility, especially in safety-critical and cost-sensitive industrial environment.
3.3. Convolutional Neural Network
CNNs have emerged as a cornerstone in image-based learning tasks due to their ability to automatically extract spatial hierarchies of features through localized receptive fields and shared parameters [
32]. These architectures are particularly advantageous when working with image data, as they reduce the number of trainable weights relative to fully connected networks, thereby mitigating overfitting and enhancing generalization performance [
33]. A structure of CNN is illustrated in
Figure 1.
In the context of this study, a standard CNN model is employed as a baseline for performance comparison. Given that the original dataset comprises one-dimensional sensor signals, these signals are transformed into two-dimensional color images to enable spatial pattern learning within the CNN framework. This transformation allows the model to utilize convolutional operations to capture representative fault features.
A typical CNN architecture is composed of the following sequential components:
Convolutional Layer:
The convolutional layer performs a set of learnable filters across the spatial dimensions of the input image. For a given input
, filter
and bias
the convolution operation is defined as:
where
denotes the convolution operator and
is the activation function, typically ReLU. These filters are responsible for detecting local patterns such as edges, corners, and textures [
34].
Activation Layer (ReLU):
ReLU is used as the activation function to provide the model with non-linear transformation capability:
ReLU facilitates faster convergence during training and alleviates the vanishing gradient problem commonly encountered in deep networks [
34].
Pooling Layer:
Pooling operations are employed to reduce the spatial resolution of feature maps, effectively lowering computational complexity and providing a form of translation invariance. Max pooling, the most common form, is defined as
where
denotes the receptive region associated with position
[
34].
Fully Connected Layer:
Following the convolutional and pooling stages, the high-level feature maps are flattened and passed through one or more fully connected layers. These layers integrate the spatially distributed features and facilitate the final classification via a softmax or sigmoid activation function, depending on the nature of the task [
34].
3.4. Residual Network (ResNet)
Conventional CNNs have demonstrated remarkable success in visual recognition tasks; however, their effectiveness tends to diminish in scenarios involving small-scale or imbalanced datasets. Moreover, as network depth increases, issues such as vanishing gradients and optimization instability emerge, often resulting in performance degradation. To address these limitations, Residual Networks (ResNets) [
17,
18] introduce a novel architectural paradigm that facilitates the training of very deep models through the use of identity shortcut connections, which bypass one or more layers. A residual block is formally expressed as
where
is the input to the block,
denotes the residual mapping to be learned (typically a series of convolution, batch normalization, and activation operations), and
is the output. This structure enables the training of networks with significantly increased depth while maintaining stable convergence.
In this study, we adopt ResNet-18, a relatively lightweight architecture consisting of 18 layers, to strike a balance between computational efficiency and classification performance. The rationale for this choice is twofold:
Computational Feasibility: Compared to deeper alternatives such as ResNet-50 or ResNet-101, ResNet-18 requires significantly fewer parameters and operations, making it more practical for real-time deployment on standard hardware platforms [
35].
Robust Feature Extraction: ResNet-18 is pre-trained on the large-scale ImageNet dataset, which contains over 1 million annotated images across 1000 categories. The early convolutional layers of the network extract domain-agnostic features—such as edges, color gradients, and textures—that are broadly applicable to diverse data modalities once appropriately formatted [
35].
To adapt the network for failure classification, the one-dimensional sensor signals are first transformed into two-dimensional color-encoded images. These images are then processed using the pre-trained ResNet-18 model. During this adaptation phase, the core convolutional layers are retained to preserve general visual feature representations, while the fully connected classification layers are fine-tuned on the target dataset to specialize in distinguishing between different failure types.
Experimental results indicate that ResNet-18 performs robustly even with a limited number of training samples and under severe class imbalance. This capability makes it particularly suitable for industrial scenarios where failure events are rare and data collection is constrained. In the subsequent section, the integration of ResNet-18 into the proposed classification framework is described in detail.
3.5. Proposed Method
In this study, we propose a transfer learning-based classification framework designed to address the challenges associated with failure detection in industrial systems, particularly those characterized by extreme class imbalance and limited labeled fault data. The proposed method integrates signal-to-image transformation, pre-trained deep feature extraction, and cost-aware training to address the primary constraints that hinder the performance of conventional machine learning and deep learning models in predictive maintenance scenarios. The general block diagram of the proposed method has been presented in
Figure 2.
The overall pipeline is illustrated in
Figure 3 and comprises the following stages:
Given that the original dataset consists of one-dimensional sensor readings, a critical pre-processing step involves transforming these time-series signals into two-dimensional color-encoded image representations. This conversion facilitates compatibility with convolutional architectures, which are inherently optimized for spatial pattern recognition. The transformation also enables the model to exploit the structural regularities in signal morphology through two-dimensional convolutions. Since the dataset consists of one-dimensional sensor measurements, each feature vector is transformed into a two-dimensional RGB image so that it can be processed by ResNet. The full procedure is as follows: the 1D vector is first normalized to , then reshaped into a compact 2D grid that preserves the original feature ordering. This grid is converted to a three-channel representation through a fixed colormap, producing an initial pseudo-image. Finally, the pseudo-image is resized to to match the input requirements of ResNet while keeping the relative spatial arrangement of features intact. This transformation enables the network to learn local co-activation patterns across neighboring features using 2D convolutions.
- 2.
Feature Representation using ResNet-18
Following the image conversion, the resulting representations are processed through the ResNet-18. The initial convolutional layers, responsible for learning generic visual primitives such as edges, contours, and textures, are preserved, whereas the final fully connected classification layers are reinitialized and fine-tuned on the target dataset. This strategy not only reduces the number of parameters that need to be trained from scratch but also significantly enhances generalization in data-constrained environments.
- 3.
Class-Rebalanced Training Approach
To counteract the adverse effects of class imbalance, the training process employs a rebalanced data sampling strategy in which an equal number of failure (FT2) and non-failure (FT1) samples are randomly selected. Experiments are conducted under varying sample approaches (e.g., 2000, 40, 20, 10 samples per class) to evaluate the stability and scalability of the proposed method in few-shot learning settings.
- 4.
Cost-Sensitive Optimization
Recognizing the asymmetric risk associated with different types of misclassification, particularly the high operational cost of false negatives in safety-critical systems, a domain-specific cost metric is employed.
4. Results and Discussion
To validate the performance of the proposed failure detection framework, we conducted a series of experiments using the Air Pressure System (APS) dataset provided by Scania. The dataset consists of 60,000 training samples and 16,000 testing samples, each with 171 sensor-based features. Six features with substantial missing values were removed. The remaining missing entries were imputed using feature-wise mean values to maintain data integrity without introducing synthetic noise.
All signals were preprocessed and transformed into RGB image representations suitable for CNN input. The training datasets were constructed in multiple configurations to evaluate performance under varying sample constraints. Specifically, balanced training subsets with sizes of 2000, 40, 20, and 10 samples (equal number of FT1 and FT2) were randomly selected from the full dataset for few-shot learning scenarios.
The deep learning models were implemented in Python 3.10 using PyTorch 2.1.0 and trained on a standard laptop equipped with an Intel Core i7 processor (2.2 GHz) and 8 GB RAM. No GPU acceleration was utilized to emphasize the framework’s suitability for low-resource environments. The ResNet-18 architecture was selected due to its optimal trade-off between representational capacity and computational efficiency. A three-layer conventional CNN was used as a baseline for comparison.
In this work, no explicit cost-sensitive loss function is applied during training. All models, including ResNet, are optimized using standard cross-entropy loss. Cost sensitivity is introduced exclusively at the evaluation stage through the APS cost metric, where false negatives and false positives incur penalties of 500 and 10 units, respectively. This design aligns with the original APS challenge, in which the cost metric is used to assess operational risk rather than to modify the training objective.
All models were trained using the Adam optimizer with a learning rate of 0.001, a batch size of 16, and standard cross-entropy loss. Training was performed for up to 100 epochs, with early stopping based on validation loss to prevent overfitting. For ResNet-18 shown in
Figure 3, the weights were initialized from ImageNet pretraining, and the final fully connected layer was replaced with a 2-unit classification head. A mild weight decay (0.0001) was applied to stabilize optimization, and the learning rate was kept constant throughout training for simplicity. These settings represent a standard and lightweight configuration commonly used in transfer-learning scenarios and were sufficient to achieve stable convergence on the APS dataset.
Performance was evaluated using both standard metrics (Precision, Recall, Specificity, F1-score, Ac, NPV) and a cost-sensitive metric designed to penalize FT2 misclassifications more heavily, reflecting the practical impact of rare failure events in real-world predictive maintenance settings.
The proposed framework’s performance is investigated in two stages. We have compared ResNet-based proposed framework and a conventional CNN with three layers in the first stage. Obtained results are presented in
Table 2. In addition, the comparisons of the costs are illustrated in
Figure 4.
Table 2 demonstrates that the proposed ResNet-based framework consistently outperforms the conventional CNN in terms of cost-efficiency and generalization across varying training sample sizes. Although CNN achieves slightly higher P, F1, and S in scenarios with extremely limited data (e.g., TSN = 10), it suffers from a notable decline in NPV and incurs significantly higher cost values. For instance, when only 10 training samples are used, the cost associated with the CNN reaches 44,194, whereas the ResNet model maintains a much lower cost of 22,728, indicating its superior robustness in cost-sensitive classification. This suggests that the CNN tends to overfit the minority class at the expense of false negatives in the majority class, which are heavily penalized by the domain-specific cost function. Furthermore, the ResNet model exhibits greater stability across different sample sizes, preserving a high level of recall (above 99.8%) and balanced NPV, even with only a handful of training examples. These findings validate the effectiveness of transfer learning in handling highly imbalanced datasets with limited failure data and underline the importance of using cost-aware metrics in evaluating classifier performance for predictive maintenance applications.
A close examination of the costs in
Figure 4 across varying TSN reveals a critical insight into the comparative robustness of the evaluated models under data-scarce conditions. As expected, reducing the TSN leads to a substantial increase in classification cost, driven primarily by the heightened likelihood of misclassifying minority class instances (FT2), which are associated with a significantly higher penalty in the defined cost function. However, the rate and magnitude of cost escalation differ markedly between the two models. The CNN model exhibits a sharp and disproportionate increase in cost as TSN decreases from 15,640 at 2000 samples to 44,194 at 10 samples indicating a strong susceptibility to overfitting and reduced generalization capacity in low-data regimes. Conversely, the proposed ResNet-based approach demonstrates a more controlled and gradual increase in cost, ranging from 14,750 to 22,728 across the same TSN spectrum. This performance stability under severe data constraints underscores the effectiveness of transfer learning in retaining discriminatory power while minimizing high-cost misclassifications.
These results validate the proposed model’s suitability for real-world predictive maintenance applications, where failure data is inherently rare and the cost of incorrect predictions particularly false negative scan be prohibitively high. The consistent cost advantage observed for the ResNet model further highlights its potential as a resource-efficient and risk-aware solution in imbalanced and cost-sensitive classification environments.
In the second stage, we have measured the performance of some well-known classifiers, including, k-nearest neighbors algorithm (KNN), support vector machine (SVM), decision tree (DT), logistic regression (LR), boost tree ensemble (RUS) and, adaboost (AB) to compare them with Resnet.
Table 3 presents a comparative analysis between the proposed ResNet-based model and a suite of conventional machine learning algorithms. All conventional methods were trained on the full dataset comprising 60,000 samples, whereas the ResNet and CNN models were trained using only 2000 balanced samples.
Although traditional methods such as DT, KNN, SVM, and AB achieve high values in conventional metrics such as Ac, P, and recall R, their associated cost values remain significantly higher than those of the ResNet model trained with only one-thirtieth of the data. For instance, SVM reaches a high Ac of 98.51% and an F1 of 99.24%, yet it incurs a cost of 114,110 due to a substantial number of high-penalty misclassifications involving the minority class. A similar pattern is observed in AB and KNN, with costs exceeding 57,000 and 77,000, respectively, despite their favorable metric profiles. This disconnect highlights the limitations of standard performance indicators in imbalanced, cost-sensitive domains.
LR performs the worst overall, exhibiting both the lowest S (1.16%) and Ac (32.78%), which translates into the highest observed cost (231,030). In contrast, RUSBoost, an ensemble model explicitly designed to address class imbalance, achieves the lowest cost (14,330) among traditional classifiers, closely followed by the proposed ResNet (14,750) and CNN (15,640), both trained on much smaller datasets.
In imbalanced classification problems—such as failure detection in predictive maintenance scenarios—the NPV represents a pivotal performance indicator, as it quantifies the proportion of true negatives among all negative predictions. High NPV is particularly desirable in contexts where failing to identify a faulty instance (i.e., a false negative) may lead to severe operational, financial, or safety-related consequences. Therefore, the ability of a classifier to confidently and accurately identify negative cases is as important as detecting the minority class.
As shown in
Table 3, considerable discrepancies are observed in NPV across the evaluated models. While several conventional classifiers achieve near-optimal values in terms of precision, recall, and accuracy such as SVM (Ac: 98.51%, F1: 99.24%) and AdaBoost (Ac: 99.09%, F1: 99.54%) their NPV values are markedly lower, recorded at 39.20 and 69.33, respectively. This performance inconsistency reveals a latent vulnerability: these models are prone to misclassifying faulty instances as healthy, which could be detrimental in real-world deployments.
By contrast, the proposed ResNet-based framework achieves an NPV of 96.53%, approaching the highest score attained by CNN (97.60%), despite being trained on significantly fewer samples (2000 versus 60,000). Notably, ResNet achieves this high NPV without incurring a substantial cost, demonstrating a more favorable balance between reliability and economic efficiency. In comparison, CNN although slightly outperforming ResNet in NPV yields a higher cost, suggesting a trade-off between overly aggressive recall and increased misclassification penalties. Moreover, while RUSBoost, the best-performing classical method in terms of cost, yields a comparable NPV of 95.20, it does so using the entire dataset. ResNet, in contrast, maintains competitive performance under limited data availability, underscoring the effectiveness of transfer learning in enhancing generalization from small-scale training sets.
These findings establish NPV as a critical yet often underemphasized metric in the evaluation of classifiers for high-risk, cost-sensitive applications. The ability of the proposed framework to simultaneously sustain high NPV and minimize misclassification cost positions it as a promising and practically viable approach for deployment in industrial fault diagnosis systems. Hence, these findings underline the cost-efficiency and data-efficiency of the ResNet-based framework. Despite operating under limited data constraints, ResNet manages to outperform or closely match models trained on full datasets, especially in cost-critical failure detection tasks. The results validate the integration of transfer learning as a viable strategy for enhancing generalization while minimizing costly misclassifications in real-world industrial applications.
The confusion matrices are shown in
Figure 5 give us to visualize the performance of all algorithms. RUS is a bit better than the proposed framework, the reason why RUS has utilized all data samples. We also show that P, R, S, F1 metrics are not convenient to measure the performance of the classifiers for this problem. For example, KNN is missed 42% FT2; however, its performance is very well regarding P, R, S, and F1 metrics. It should be noted that the classification performance of the ResNet trained with only ten samples is better than all algorithms except RUS.
A comparison of the total execution (TE) times shows a clear separation between classical tabular models and deep convolutional architectures. Methods such as DT, KNN, LR, and RUSBoost complete full-dataset training within 6–16 s, demonstrating their low computational overhead on structured data. SVM is noticeably slower but still capable of processing all 60,000 samples in under 90 s. In contrast, CNN-based models require significantly higher computational effort due to the image-based input representation and convolutional operations. Even when trained on a reduced subset of 2000 samples, the CNN and ResNet models exhibit total execution times of 221 and 521 s, respectively. This comparison highlights that while CNNs provide richer feature extraction capabilities, classical models remain substantially more efficient for large-scale tabular data, whereas deep models incur higher runtime costs that must be considered in resource-constrained or edge-oriented deployments.
Table 4 presents a comparative analysis of the proposed ResNet-based method against a diverse collection of previously published fault detection models, encompassing both traditional machine learning and modern deep learning approaches. While a majority of these studies report high performance in conventional metrics—such as Ac and F1—they frequently overlook cost-sensitive metrics or NPV, which are particularly vital in high-stakes predictive maintenance applications. Several recent studies—including those by Chen [
36], Tanhandiki [
37], and Hussain [
38]—demonstrate strong classification performance with reported accuracies exceeding 98% and F1-scores above 97%. However, these works lack cost or NPV reporting, which limits their interpretability in contexts where the consequences of false negatives outweigh other classification errors. For instance, Hussain’s GB and AB models achieve accuracies of 98.44% and 97.56%, respectively, but provide no insights into their performance in minimizing undetected failures.
In contrast, the proposed ResNet-based model demonstrates a more comprehensive performance profile. With an F1-score of 97.25%, accuracy of 94.76%, NPV of 96.53%, and a comparatively low cost of 14,750, it offers a well-balanced solution that addresses not only predictive capability but also operational risk and economic efficiency. This contrasts with alternative models such as the AB variant of the proposed method, which, while achieving higher accuracy (99.09%), results in a significantly higher cost (57,800) and substantially lower NPV (69.33%), highlighting the trade-off between aggressive classification and risk-sensitive performance. Furthermore, among the few published works that report cost explicitly—such as Kafunah’s DLWP-based methods [
42]—the proposed ResNet model achieves substantially lower misclassification cost, despite being trained under more constrained data conditions. This reinforces its effectiveness in handling imbalanced, high-risk scenarios without relying on extensive computational or data resources.
To better understand how the proposed representation compares with conventional approaches, a direct 1D CNN baseline was also evaluated using the raw feature vectors. Although this model reached a relatively high accuracy of 97.66% and an NPV of 97.66%, it ultimately failed to identify any positive cases due to the severe 59:1 class imbalance in the APS dataset. As a result, the model produced an F1-score of 0 and a very high misclassification cost of 187,500. This outcome reflects a known issue in the APS dataset, where cost, not F1, is the most reliable evaluation metric, and it illustrates the limitations of relying solely on accuracy-based measures in highly skewed scenarios.
To further interpret the classification outcomes, we briefly examined the misclassified samples. We observed that most errors occur in regions where sample measurements lie close to the decision boundary, making them inherently ambiguous. These instances typically exhibit incomplete or weak patterns rather than clear fault signatures. Such borderline cases explain why a small number of FP and FN predictions remain even after model optimization. Their presence is expected in high-dimensional sensor datasets, and the APS cost metric provides a practical way to account for their impact on real-world decision-making.
The comparative findings underscore the practical advantages of the ResNet-based framework in delivering robust, cost-aware, and risk-sensitive classification performance. By incorporating underutilized but critical metrics such as NPV and misclassification cost, the proposed approach addresses important gaps in the current literature and advances the applicability of deep learning in real-world predictive maintenance systems.