CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction

Naderi Bakhtiyari, Ali; Hassani, Vahid; Omidi, Mohammad

doi:10.3390/machines14040375

Open AccessArticle

CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction

by

Ali Naderi Bakhtiyari

^1,*

,

Vahid Hassani

²

and

Mohammad Omidi

³

¹

Center for Advanced Laser Manufacturing (CALM), School of Mechanical Engineering, Shandong University of Technology, Zibo 255049, China

²

Department of Engineering, School of Science and Technology, City St George’s, University of London, London EC1B 0HB, UK

³

School of Electrical Engineering, Hebei University, Baoding 071002, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(4), 375; https://doi.org/10.3390/machines14040375

Submission received: 19 January 2026 / Revised: 16 March 2026 / Accepted: 26 March 2026 / Published: 28 March 2026

(This article belongs to the Special Issue Intelligent Fault Diagnosis and Predictive Maintenance Systems: Advanced Methods for Industrial Equipment and Dynamic Operating Conditions)

Download

Browse Figures

Versions Notes

Abstract

Remaining Useful Life (RUL) prediction is a key component of prognostics and health management for modern industrial systems. While deep learning methods have significantly improved prediction accuracy, many existing approaches rely on large neural networks that are difficult to deploy on resource-constrained edge devices. This study presents a deployment-oriented evaluation of compact neural networks for RUL prediction using the NASA C-MAPSS turbofan engine benchmark. Two lightweight hybrid architectures, CNN–GRU and CNN–TCN, were developed with approximately 28k–32k parameters to represent realistic models for CPU-based edge inference. A systematic experimental analysis was conducted across all four C-MAPSS subsets (FD001–FD004), which represent increasing levels of operational and fault complexity. In addition to baseline performance, two post-training compression techniques (i.e., global unstructured magnitude pruning and dynamic INT8 quantization) were evaluated. To assess real deployment behavior, inference latency was measured on both a high-performance Intel x86 workstation and a resource-constrained ARM platform. Results show that CNN–GRU generally achieves higher predictive accuracy, whereas CNN–TCN provides more consistent and lower inference latency due to its convolution-only temporal modeling. Unstructured pruning can yield modest improvements in prediction accuracy, suggesting a regularization effect, but it does not reliably reduce model size or latency on standard CPUs due to the overhead associated with pruning masks. Dynamic quantization substantially reduces model size (particularly for CNN–GRU) while preserving predictive accuracy; however, it increases runtime latency because of additional quantization and dequantization operations. These findings demonstrate that compression techniques commonly used for large models do not necessarily translate into deployment benefits for already compact RUL architectures and highlight the importance of hardware-aware evaluation when designing edge prognostics systems.

Keywords:

remaining useful life (RUL); model compression; lightweight neural networks; quantization; C-MAPSS dataset

1. Introduction

Prediction of Useful Remaining Life (RUL) is a key aspect of Prognostics and Health Management (PHM). RUL describes the point of failure for a component or system based on an analysis of its current state and historical usage [1,2,3]. Providing accurate RUL predictions enables condition-based and predictive maintenance strategies. This helps to minimize unplanned downtime, increase the availability of an asset, and increase safety across various industries, including Aerospace, Energy, Transportation, and Manufacturing [4,5]. At the same time, the accelerated growth of Industry 4.0 technologies and the Industrial Internet of Things (IIoT) is generating greater volumes and higher frequencies of sensor data from complex industrial assets [6,7]. This rapid evolution provides opportunities and challenges. Larger data streams lead to higher accuracy in developing RUL models [8,9]. However, deploying RUL models on edge devices imposes strict limitations on available memory, computational power, and energy consumption [10].

Over the last decade, deep learning has established itself as the dominant paradigm for RUL prediction using data-driven approaches, which have included a wide variety of different neural network architectures to perform this task [1,11]. A wide variety of RUL architectures have been investigated, including multilayer perceptron networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs) (e.g., Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)), temporal convolution networks (TCN), and the most recent type of architecture, which is attention-based and transformer-style models [12,13].

Recent advances in condition monitoring have further expanded the landscape of predictive maintenance beyond direct end-to-end RUL regression. For example, uncertainty-aware sensorless frameworks have been proposed that rely solely on motor driver signals to construct interpretable degradation indicators without requiring additional physical sensors, thereby reducing hardware cost and deployment complexity in industrial systems [14]. Such approaches often employ lightweight neural architectures combined with signal decomposition techniques to extract smooth degradation trends and enable robust early anomaly detection under multiple operating conditions. In parallel, vibration-based diagnostic methods for rotating machinery (such as planetary gearboxes operating under speed-varying conditions) have emphasized advanced cyclo-non-stationary signal analysis, including order-frequency cyclic spectral coherence and adaptive band selection strategies to enhance fault detectability under non-steady regimes [15].

The NASA C-MAPSS dataset was initially released to the public in 2005 and has since evolved into a quasi-standard benchmark for RUL-predicting neural networks for turbofan engines [16]. The vast majority of papers published in the RUL literature have used some version of C-MAPSS in experiments examining various RUL prediction methods, input encodings, and degradation scenarios [17]. While RUL models have improved dramatically in predictive accuracy, very few existing RUL models have been optimized for deployment on edge devices with very small real-time inference capabilities; instead, the vast majority of current state-of-the-art RUL models were developed with the expectation of having unlimited computational resources at the time of prediction [18,19]. While these architectures improve RUL accuracy, they typically result in very large numbers of learnable parameters and require an immense increase in the floating-point operations per second required for predictions on edge devices. Therefore, many recent studies have noted that their models remain too computationally complex to be deployed in real time on low-power edge devices [20,21]. To this end, many researchers have developed an extensive toolbox of model compression techniques to reduce the memory footprint, latency, and energy consumption of model inference while maintaining predictive performance [22]. Examples of these types of strategies include unstructured and structured pruning, quantization of weight and activation values, low-rank factorization, transfer of knowledge through distillation, and combinations of all of the above [23].

However, there is still a noticeable gap between deep-learning-based PHM and edge-oriented model compression. A few recent works have explored pruning- or attention-based lightweight RUL models, often focusing on advanced architectures such as transformers or graph neural networks [24]. Yet systematic studies that examine how classical compression techniques behave on already compact RUL architectures, especially across multiple C-MAPSS subsets with varying operating conditions and fault modes, are still limited. Most edge-AI compression evaluations focus on large-scale image or language models, where networks have millions of parameters and obvious redundancy. It is not clear whether the same methods provide meaningful benefits for small, task-specific models with tens of thousands of parameters that are more representative of realistic embedded PHM deployments [25].

From a practical perspective, industrial users and control engineers often prefer simple and stable architectures that can be tuned, validated, and certified more easily than very deep networks. For prognostics on IIoT-enabled machinery, a modest CNN–GRU network with only tens of thousands of parameters can already offer good accuracy while remaining interpretable enough to integrate with existing monitoring pipelines [26]. The key question of this study is how to systematically characterize the trade-offs between accuracy, memory, and latency when applying standard compression techniques to such compact RUL models.

Despite rapid progress in deep-learning-based RUL prediction, several practical deployment challenges remain insufficiently addressed. In particular:

Most state-of-the-art RUL models prioritize predictive accuracy while overlooking deployment constraints such as CPU latency, memory footprint, and real-time behavior.
Model compression techniques (e.g., pruning and quantization) are often evaluated on large-scale vision or language models, but their effectiveness on already compact PHM architectures remains unclear.
There is limited systematic evidence on how compression behavior varies across degradation complexity (single vs. multiple operating conditions and fault modes).
Improvements in model size are frequently assumed to imply improvements in runtime efficiency, which may not hold in CPU-only edge deployment scenarios.
The interaction between model architecture (recurrent vs. convolutional temporal modeling) and post-training compression remains underexplored for compact RUL predictors.

To address these gaps, this study makes the following contributions:

Development and validation of two compact hybrid architectures (CNN–GRU and CNN–TCN) with approximately 28k–32k parameters for realistic CPU-based deployment scenarios.
A systematic, multi-dataset evaluation (FD001–FD004) analyzing how degradation complexity influences compression behavior.
Controlled assessment of unstructured magnitude pruning using a matched fine-tuning baseline (0.0 sparsity control) to isolate sparsification effects from additional training.
Deployment-oriented benchmarking, including predictive accuracy, serialized model size, and rigorously measured single-sample CPU inference latency.

2. Methodology

2.1. Problem Formulation

The objective of RUL prediction is to estimate the number of operating cycles with which a system can continue to function before reaching a predefined failure threshold. In this work, the system under consideration is a turbofan engine operating under time-varying conditions, as represented in the NASA C-MAPSS benchmark datasets. The benchmark is organized into four subsets: FD001 (single operating condition, one fault mode), FD002 (multiple operating conditions, one fault mode), FD003 (single condition, two fault modes), and FD004 (multiple conditions, two fault modes) [16]. Let

x_{i} \in R^{C}

denote the vector of C sensor measurements and operational settings observed at operating cycle i for a given engine. At a prediction time step t, the RUL model receives as input a fixed-length temporal window consisting of the most recent L cycles

X_{t} = \{x_{t - L + 1}, x_{t - L + 2}, . . ., x_{t}\},

and produces a scalar output

\hat{y_{t}} \in R

, which represents the predicted remaining useful life at cycle t. The ground-truth RUL, denoted by

y_{t}

, is defined as the number of remaining cycles from the current time step t until the end of life of the engine (i.e., the final observed operating cycle prior to failure). Following standard practice in the C-MAPSS benchmark, RUL values may be capped during training to mitigate the dominance of early-life samples.

A supervised learning approach was used to train the model by minimizing the mean squared error (MSE) between the predicted and actual RUL values across all training data samples. The training goal is, thus, defined as

M S E = \frac{1}{N} \sum_{t = 1}^{N} (\hat{y_{t}} - {y_{t})}^{2}

, where N is the number of training samples. This formulation allows for the comparison of various temporal modelling architectures and also provides a common framework for evaluating these models under the same data preprocessing, training, and evaluation conditions.

2.2. Baseline Model Architecture

In this study, two compact temporal models were designed to examine the accuracy–efficiency trade-off under realistic CPU deployment constraints: CNN–GRU and CNN–TCN (Figure 1). The two architectures were intentionally kept small, with approximately 28k and 32k parameters, respectively, so that their compression behavior could be evaluated in a setting more representative of embedded and resource-constrained PHM applications than large-scale deep networks.

Both models share the same lightweight front-end feature extractor and operate on the fixed-length multivariate input window defined in Section 2.1. Local temporal patterns are first extracted using two 1D convolutional layers applied along the temporal dimension. The first Conv1D layer maps the input channels from

C

to 32 feature maps using kernel size 3, stride 1, and padding 1, followed by a ReLU activation. A second Conv1D layer then maps 32 channels to 32 channels using the same kernel configuration, again followed by ReLU. No temporal pooling is applied in this shared feature extractor, thereby preserving the full sequence resolution before temporal modeling.

The CNN–GRU model uses a single-layer unidirectional GRU after the shared CNN front-end. The GRU receives 32-dimensional features at each time step and uses a hidden size of 64. Its final hidden state is used as a compact sequence representation and is passed to a lightweight regression head consisting of two fully connected layers,

64 \to 64 \to 1

, with a ReLU activation between them. The output is a scalar RUL estimate. This architecture is intended to capture cumulative degradation behavior and longer-range temporal dependencies while remaining computationally lightweight.

The CNN–TCN model replaces the recurrent layer with a residual temporal convolutional network. After the shared CNN front-end, the model applies four residual TCN blocks with constant channel width 32 and dilation factors {1, 2, 4, 8}. Each block contains two dilated Conv1D layers with a kernel size of 3, ReLU activations, dropout, and a residual connection. This design progressively enlarges the temporal receptive field while preserving a compact parameter budget. After the final TCN block, the output sequence is aggregated by selecting the representation at the last time step, which is then passed to a lightweight regression head of the form

32 \to 64 \to 1

. In this way, the CNN–TCN provides a convolution-only alternative to recurrent temporal modeling while keeping the feature extraction and regression stages comparable.

Together, these two architectures enable a controlled comparison between recurrent sequence modeling and convolution-only temporal modeling under nearly identical front-end and regression settings. This design isolates the effect of the choice of temporal model on predictive accuracy, model size, and CPU deployment behavior.

It is important to note that this study deliberately focuses on post-training compression techniques that do not alter the underlying network topology. Structured pruning, low-rank factorization, and quantization-aware training (QAT) were not included, since the aim here is to assess what can realistically be achieved when already-trained compact PHM models are compressed in a post hoc manner, as is common in practical industrial workflows.

It is important to note that this study deliberately focuses on post-training compression techniques that do not alter the underlying network architecture. Structured pruning, low-rank factorization, and Quantization-Aware Training (QAT) are intentionally excluded to isolate deployment-scale effects and assess what can realistically be achieved when models are compressed after training, as is common in industrial and legacy PHM pipelines. While such architecture-aware techniques are promising, they introduce additional design complexity and retraining requirements that fall outside the scope of this work and warrant future investigation.

2.3. Compression Methods

To assess deployment efficiency beyond architectural design alone, we investigate a post-training compression pipeline comprising unstructured magnitude pruning and dynamic quantization (DQ), as illustrated in Figure 1. Both techniques are commonly used in practical edge-AI workflows and can be applied after baseline model training without altering the original optimization objective. We apply global unstructured magnitude pruning to selected weight tensors in the Conv1D layers and fully connected regression heads of both architectures at multiple target sparsity levels. Parameters with small absolute magnitudes are progressively set to zero under the assumption that they contribute marginally to the final prediction. Following pruning, each model undergoes a brief fine-tuning stage of 5 additional epochs using the same optimizer and learning-rate schedule as in baseline training. This limited fine-tuning is intended solely to recover potential accuracy degradation induced by sparsification. To explicitly account for the potential confounding effect of additional training, we introduce a matched control condition corresponding to a pruning target of 0.0. In this case, the standard model will have gone through five epochs of tuning in exactly the same way and will include no parameter pruning. With this control in place, we will be able to determine whether performance increases are attributable to the sparsification process itself rather than to additional optimization iterations. All pruning results are therefore interpreted relative to this matched control rather than to the original 50-epoch baseline alone. It is important to note that unstructured pruning does not alter tensor shapes or layer connectivity. Consequently, inference remains computationally intensive on conventional CPUs, and reductions in inference latency or floating-point operations are not guaranteed without sparse-aware kernels or specialized hardware support. In this study, unstructured pruning is therefore evaluated primarily as a tool for probing model redundancy and potential regularization effects, rather than as a direct mechanism for deployment-time acceleration or memory reduction.

In parallel, we employ post-training dynamic INT8 quantization using PyTorch’s standard eager-mode quantization workflow. For the CNN–GRU architecture, DQ is applied to the GRU and fully connected layers, while convolutional layers remain in FP32. For the CNN–TCN architecture, DQ affects only the fully connected regression head, since Conv1D layers are not dynamically quantized in the standard PyTorch pipeline. Importantly, DQ is applied without retraining, which aligns with industrial scenarios where access to training data or retraining resources may be limited.

2.4. Evaluation Metrics

In order to evaluate the Performance of Edge Prognostics, the two perspectives of predictive performance and the deployment perspective are used to create accurate and efficient trade-off relationships when implementing Edge Prognostics. The Performance metrics employed are Root Mean Square Error (RMSE), which is sensitive to large deviations, and Mean Absolute Error (MAE), which provides an average indication of Absolute Deviations. The entire measurement procedure was repeated five times, and the reported latency values correspond to the mean ± standard deviation of the median latency across these repetitions. Reporting variability is important because runtime performance can fluctuate due to operating system scheduling, CPU cache behavior, and background processes. In contrast, the accuracy metrics (RMSE and MAE) for the baseline and dynamically quantized models are deterministic because they are computed from a single fixed trained model evaluated on the complete test set. Therefore, repeated evaluations yield identical values. For the pruning experiments, however, the additional sparsification and fine-tuning stage introduces stochastic optimization effects, and RMSE/MAE variability across runs is therefore reported.

3. Dataset and Experimental Setup

3.1. Dataset Description: NASA C-MAPSS

The evaluation of the proposed frameworks is conducted using the NASA C-MAPSS Turbofan Engine Degradation Simulation Dataset, a popular benchmark for assessing the efficacy of new PHM methods [16]. The dataset contains multivariate time-series data, providing measurements of the condition of multiple simulated Turbofans operating under different flight conditions and faulting modes. The dataset contains three operating conditions and twenty-one sensor variables for each Turbofan, plus an engine identifier and a cycle number for each Turbofan cycle. After removing the engine identifier and the cycle number from the dataset, the data recorded at each cycle can be represented by a 24-dimensional Feature Vector. The healthy-to-emergency situations recorded represent the entire end-of-life cycle of each Turbofan engine during testing. These test-run conditions concluded before the failure of the Turbofan engines, and the corresponding data used to determine RUL for each Turbofan engine has been produced from the dataset as a Supplementary File. To facilitate accurate reproduction and comparison with previous methods, we use the standard NASA training and testing datasets.

C-MAPSS is classified into four subsets based on levels of variability and fault complexity during operation [16]. Subset FD001 consists of one operating condition (OC) and one fault mode (FM). This is the simplest form of degradation. Subset FD002 adds additional OCs but has only one FM. Subset FD003 has two FM’s and one OC. Finally, subset FD004 has multiple OCs and FMs, making it the most difficult subset. These levels of complexity are used to enable comparison of an algorithm’s robustness across different types of degradation. To assist in choosing an algorithm and to fine-tune after pruning, a small portion of the training data (the validation data) is set aside for trajectory-level validation. Validation data may only be used during training and hyperparameter searches and are not included in performance reporting.

3.2. Preprocessing and Data Preparation

Prior to model training, the raw C-MAPSS time-series data undergo a consistent preprocessing pipeline. First, all sensor channels and operational settings are standardized using z-score normalization, where the mean and standard deviation are computed exclusively from the training set and then applied to the validation and test sets. This prevents information leakage and ensures stable optimization across features with different physical scales. RUL labels are constructed as the number of cycles remaining until the final cycle of each engine trajectory. To reduce the dominance of early-life samples and align with common practice in the C-MAPSS literature, RUL values were capped during training. We also performed a small sensitivity check on the cap value. Among the tested settings (100, 125, and 150 cycles), a cap of 125 cycles provided the best overall balance between predictive accuracy and training stability for the compact models considered here. A smaller cap compressed the target range too aggressively, whereas a larger cap increased target variability and reduced the benefit of compact modeling. Accordingly, the RUL target at the cycle

t

was defined as:

{R U L}_{t} = m i n (c y c l e s t o E O L, 125) .

For sequence-based learning, each engine’s Decline History is divided into small, equal pieces. The fixed size of these pieces is determined by the sliding window length (50). The input data comprises a set of 50 sequential cycles (an engine’s Decline History), and the output data (i.e., target) is the RUL for the last cycle of the window being used. This method allows variable-length Decline History data to be converted into a large number of supervised learning samples of equal size for mini-batch training. To maintain information consistency and to stay within the scope and traditions of the PHM community, every feature has been retained in the input data for each Decline History sample (i.e., the set of 21 sensors and the three operational settings). For stochastic optimization, the Decline History samples produced are shuffled and combined into mini-batches during training. However, during the validation and test phases of the evaluation, the Decline History samples maintain their original order to preserve dependencies. After processing, the input tensors have shape (BatchSize, WindowSize, Sensor + OperationalSettingsCount), where BatchSize = B, WindowSize = L = 50, and Sensor + OperationalSettingsCount = C = 24 total sensors and operational-setting channels.

3.3. Experimental Setup

All models were trained using the Adam optimizer with an initial learning rate of 1 × 10⁻³. A cosine decay schedule was applied over 50 training epochs, smoothly reducing the learning rate to 1 × 10⁻⁴ by the final epoch. The batch size was fixed at 256 for all experiments, and L2 weight decay was set to 1 × 10⁻⁵. MSE was used as the optimization loss. To enhance training stability, the maximum allowable gradient norm was set to 1.0. For dropout, 0.1 was used. Each of the four subsets of the C-MAPSS dataset (FD001-FD004) was trained for 50 epochs using the same preprocessing, window length (L = 50), and optimization settings. The random seed was set to 42 for Python, NumPy, and PyTorch to enable reproducibility. Deterministic behavior could be enforced by the backend when supported. However, complete determinism is not guaranteed for some lower-level CPU operations.

To evaluate deployment behavior across different CPU architectures, inference latency was measured on two platforms. The first was a desktop workstation running Windows 10 with an Intel x86_64/AMD64 processor, 24 physical cores, 32 logical cores, and 31.64 GB RAM, using Python 3.14.0 and PyTorch 2.9.1+cpu. The second was a cloud-based ARM platform running Linux 6.8.0-1044-aws on aarch64 architecture, with 2 physical cores, 2 logical cores, and 1.8 GB RAM, using Python 3.10.12 and PyTorch 2.10.0+cpu. This dual-platform setup allows the latency behavior of the proposed models and compression methods to be examined in both a high-performance desktop CPU and a resource-constrained ARM environment, more representative of edge-oriented deployment conditions.

Latency measurements were performed using single-sample inference (batch size = 1) to emulate real-time online prediction. For each model configuration, 50 warm-up runs were executed to mitigate caching and initialization effects, followed by 200 timed inference runs. The reported latency corresponds to the median runtime with the corresponding variability across repeated runs, providing a robust estimate of real-time inference performance.

All experiments were implemented in Python using the PyTorch deep learning framework. Runtime environment details—including CPU architecture, operating system, and PyTorch threading configuration—were recorded alongside latency measurements to support reproducibility. Model size, parameter counts, and sparsity statistics were extracted using PyTorch utilities and verified using serialized model checkpoints.

To ensure fair comparisons across model architectures and compression methods, the PyTorch CPU backend threading configuration was kept constant throughout all experiments. The intra-op and inter-op thread settings were not individually optimized for each model configuration. Although CPU inference latency can be sensitive to thread scheduling and system load, maintaining a fixed threading environment ensures that the reported latency differences primarily reflect the computational characteristics of the evaluated models rather than variations in runtime configuration.

4. Results and Discussion

Table 1 summarizes the baseline performance of the two compact architectures across all four C-MAPSS subsets, evaluated on two different CPU platforms: a desktop Intel i9 (x86) workstation and an AWS ARM (aarch64) cloud instance. CNN-GRU has 28,481 parameters (0.113 MB), while CNN-TCN is slightly larger at 32,449 parameters (0.133 MB). Since RMSE and MAE reflect the predictive capability of the trained models rather than the execution hardware, the results are largely consistent across platforms, with only minor numerical differences due to variations in runtime environments and floating-point arithmetic. In contrast, inference latency is clearly dependent on the hardware architecture. As expected, the high-performance Intel i9 platform achieves significantly lower runtimes (approximately 0.59–0.69 ms) compared with the resource-constrained ARM instance (approximately 1.75–1.89 ms). Despite this difference in absolute runtime, the relative behavior of the two models remains consistent across platforms. Although CNN–TCN generally achieves slightly lower latency due to the parallelizable nature of convolutional operations, while CNN–GRU typically provides better predictive accuracy, particularly on FD001–FD003. These baseline results establish a consistent accuracy–efficiency trade-off that serves as a reference point for the pruning and quantization experiments presented in the following sections.

To better contextualize the baseline errors reported in Table 1, we compare our compact CNN–GRU and CNN–TCN models with two widely cited lightweight approaches evaluated on the C-MAPSS benchmark. Zheng et al. [27] reported a compact CNN-based architecture achieving RMSE values of 18.45 (FD001), 30.29 (FD002), 19.82 (FD003), and 29.16 (FD004). Zhao et al. [28] proposed a double-channel CNN–BiLSTM hybrid model that prioritizes prediction accuracy and reported substantially lower errors of 12.58/19.34/12.18/20.03 on FD001–FD004, respectively. Compared with these accuracy-oriented designs, the models in this study are intentionally positioned at a different point in the accuracy–efficiency trade-off, prioritizing compact model size and CPU deployment capability. As shown in Table 1, the proposed CNN–GRU baseline achieves RMSE values of 18.21/29.36/16.19/31.33 on the Intel i9 platform and 17.92/30.20/17.05/31.73 on the AWS ARM platform (FD001–FD004). Similarly, CNN–TCN achieves RMSE values of 25.82/30.95/23.23/31.62 on Intel i9 and 23.66/31.10/23.39/31.63 on AWS ARM. Although these errors are slightly higher than those reported by larger accuracy-oriented architectures, the proposed models maintain extremely compact parameter counts (≈28k–32k parameters) and very small, serialized model sizes (0.113–0.133 MB), while achieving sub-millisecond inference latency on the Intel i9 CPU and approximately 1.7–1.9 ms latency on the AWS ARM platform.

4.1. Unstructured Magnitude Pruning

We next evaluate global unstructured magnitude pruning as a post-training compression strategy for both baseline architectures: CNN–GRU and CNN–TCN. Pruning was applied to the convolutional and fully connected layers of both models, and to the temporal convolution layers within the TCN blocks. For each dataset, pruning ratios of 0.3, 0.5, 0.7, and 0.9 were examined, followed by five epochs of fine-tuning to recover potential performance loss. The best pruning level for each dataset was selected according to the lowest test RMSE. To examine deployment relevance, we report not only RMSE and MAE, but also single-sample CPU inference latency on both the Intel i9 desktop CPU and the AWS ARM platform.

Table 2 and Table 3 show that unstructured pruning can improve predictive accuracy in several cases, but its impact is strongly dependent on both the dataset and the underlying architecture. For CNN–GRU (Table 2), pruning yields clear gains on FD001, FD002, and FD004 on both platforms, with the best sparsity levels ranging from 0.3 to 0.9. On the AWS ARM platform, the largest improvements are observed for FD001 and FD003, where RMSE decreases from 17.92 to 15.83 and from 17.05 to 15.46, respectively. On the Intel i9 platform, the improvements are more moderate, with FD001 decreasing from 18.21 to 16.70 and FD004 from 31.33 to 30.26. However, the effect is not universally beneficial: on FD003 with the Intel platform, pruning slightly worsens RMSE relative to the baseline. Overall, these results indicate that pruning may act as a form of regularization in compact recurrent architectures, but the magnitude of benefit is dataset- and platform-dependent.

For CNN–TCN (Table 3), the best pruning setting is generally 0.7 sparsity on both platforms, except for FD004 on the Intel i9 CPU, where 0.5 sparsity gives the best result. Compared with CNN–GRU, CNN–TCN exhibits stronger pruning-related gains in several subsets, especially on FD003, where RMSE decreases from 23.39 to 18.51 on AWS ARM and from 23.23 to 20.26 on Intel i9. Likewise, FD001 and FD002 show consistent reductions in RMSE under moderate-to-high sparsity. These results suggest that convolutional temporal modeling may contain a greater degree of removable redundancy than the compact recurrent alternative, particularly in the multi-fault regimes.

Despite these accuracy improvements, the deployment implications of unstructured pruning are less favorable. On both platforms, the serialized model size increases rather than decreases. For CNN–GRU, the checkpoint size grows from approximately 0.113 MB to 0.151 MB, while for CNN–TCN, it increases from 0.133 MB to 0.260 MB. This occurs because PyTorch’s pruning implementation stores pruning masks and reparameterized weights in the saved checkpoint. Thus, although many parameters are set to zero, the serialized artifact grows unless additional pruning is applied or a sparse deployment format is used.

A similar pattern is observed for runtime performance. On the AWS ARM platform, pruning consistently increases inference latency for both models. For example, CNN–GRU on FD001 increases from 1.875 ± 0.022 ms at baseline to 2.425 ± 0.045 ms at 0.9 sparsity, while CNN–TCN on FD001 increases from 1.862 ± 0.013 ms to 2.305 ± 0.019 ms at 0.7 sparsity. For most AWS experiments, the latency increase is modest but systematic, indicating that unstructured sparsity does not translate into actual acceleration on standard dense CPU kernels. On the Intel i9 platform, the behavior is more mixed but generally follows the same conclusion: latency is often unchanged or slightly increased after pruning. In a few cases, latency changes are irregular due to hardware-specific runtime effects, but pruning does not provide consistent speedup on either CPU platform.

Taken together, these results show that unstructured magnitude pruning can improve accuracy for compact RUL models but does not reliably improve deployment efficiency on standard CPUs. The improvements in RMSE indicate that sparsity may serve as a useful regularizer, especially for CNN–TCN and for more complex degradation settings such as FD003 and FD004. However, because the pruning masks increase checkpoint size and dense CPU execution does not exploit sparsity effectively, the resulting models are not necessarily smaller or faster in practice. Therefore, within the present deployment setting, unstructured pruning should be interpreted primarily as an accuracy-oriented regularization mechanism rather than a true runtime compression method. These findings motivate the need for more hardware-aware approaches, such as structured pruning, sparse-kernel execution, or low-precision quantization, when the goal is to improve real deployment efficiency rather than only predictive performance.

4.2. Dynamic Quantization

To further investigate deployment-oriented compression, we evaluated DQ for both architectures. DQ was applied to the linear layers of the trained CNN–GRU and CNN–TCN models using the PyTorch CPU backend. This approach converts floating-point weights to INT8 representations, while activations remain in floating point and are dynamically quantized at runtime. The main objective of this experiment was to examine whether quantization can reduce model size and improve inference efficiency without significantly degrading predictive accuracy. The results are summarized in Table 4.

Across all datasets, DQ produces negligible changes in predictive accuracy for both architectures. For the CNN–GRU model, the RMSE differences between the FP32 baseline and the quantized model remain very small. For example, on the Intel i9 platform, RMSE changes from 18.21 to 18.29 on FD001 and from 16.19 to 16.21 on FD003, while on AWS ARM the corresponding values change from 17.92 to 17.76 and 17.05 to 17.14, respectively. Similar stability is observed on FD002 and FD004. The CNN–TCN model exhibits comparable behavior, with RMSE differences remaining within a narrow range across all datasets. For instance, on Intel i9, the RMSE for FD003 decreases slightly from 23.23 to 23.04, while on AWS ARM it decreases from 23.39 to 23.33. These results confirm that DQ preserves the models’ predictive capability, with only minimal numerical variation introduced by reduced precision.

In terms of model size, DQ produces a substantial reduction for CNN–GRU, while the effect on CNN–TCN is relatively small. The serialized size of the CNN–GRU decreases from approximately 0.113 MB to 0.0506 MB, a reduction of about 55%. This reduction occurs because the GRU architecture contains several large linear weight matrices that can be efficiently quantized to INT8. In contrast, the CNN–TCN model shows only a slight reduction in serialized size, from approximately 0.133 MB to 0.129 MB, since a large portion of the parameters are located in convolutional layers that are not quantized by the DQ scheme used here.

Despite the favorable reduction in model size, runtime latency consistently increases after quantization on both CPU platforms. For CNN–GRU, the median inference latency on the Intel i9 platform increases from approximately 0.70 ms to about 1.30 ms, while on AWS ARM it increases from around 1.87–1.95 ms to roughly 2.86–3.05 ms depending on the dataset. A similar trend is observed for CNN–TCN. On Intel i9, latency increases from approximately 0.61–0.62 ms for the FP32 model to about 0.71–0.73 ms after quantization. On AWS ARM, latency increases from roughly 1.90–1.97 ms to about 2.20–2.26 ms. These results indicate that DQ introduces additional computational overhead from runtime quantization and dequantization, which can outweigh the theoretical advantages of reduced numerical precision in CPU-based inference. PyTorch dynamic quantization primarily accelerates linear layers, while convolution layers remain in FP32; intermediate tensors must be repeatedly converted between numerical formats. For small models with low arithmetic intensity, these conversion and memory-access costs can dominate the overall runtime, causing the quantized model to exhibit higher latency than the FP32 baseline despite the reduced numerical precision.

Overall, the results in Table 4 demonstrate that dynamic quantization is effective for reducing model size, particularly for CNN–GRU, while maintaining nearly identical predictive accuracy. However, under the evaluated CPU deployment conditions, the technique does not accelerate inference but instead increases runtime latency. This behavior highlights an important practical consideration: although quantization can significantly reduce memory footprint, the performance benefits depend strongly on the underlying hardware support and the specific operations present in the model architecture. Consequently, for the compact RUL prediction models studied here, dynamic quantization should primarily be viewed as memory-efficiency optimization rather than a latency optimization strategy when deployed on general-purpose CPUs.

4.3. RMSE and Latency Comparison Across Compression Methods

Figure 2 compares prediction accuracy (RMSE) among the baseline FP32 models, unstructured pruning, and DQ for the CNN–GRU architecture on the four C-MAPSS datasets (FD001–FD004) across the Intel i9 (x86) and AWS ARM platforms. Overall, pruning provides the most consistent improvement in prediction accuracy across datasets. For instance, on FD001, the pruned model achieves RMSE values of 16.70 cycles on Intel i9 and 15.83 cycles on AWS ARM, improving upon the corresponding baseline results of 18.21 and 17.92 cycles, respectively. Similar improvements are observed on FD002 and FD004, where pruning reduces the RMSE by approximately 0.22–1.59 cycles depending on the platform. The largest improvement occurs on FD003 on the AWS ARM platform, where RMSE decreases from 17.05 to 15.46 cycles, indicating that sparsity can sometimes act as a regularizer, improving generalization.

In contrast, DQ has only a minimal effect on prediction accuracy. Across all datasets and platforms, the RMSE difference between the FP32 baseline and the dynamically quantized model remains small, typically within ±0.2 cycles. Similar behavior is observed for FD001, FD003, and FD004. These results indicate that DQ preserves the model’s predictive capability while operating with reduced numerical precision. Overall, the RMSE comparison indicates that pruning slightly improves predictive performance, whereas DQ primarily maintains baseline accuracy without noticeable degradation.

Figure 3 compares the inference latency of the same model configurations across both CPU platforms. The results reveal clear differences in runtime behavior between compression techniques. The baseline FP32 models exhibit the lowest latency on the Intel i9 platform, with median inference times around 0.69–0.70 ms, depending on the dataset. On the AWS ARM platform, the corresponding baseline latency is approximately 1.85–1.95 ms, reflecting the smaller ARM instance’s reduced computational capacity.

After pruning, latency generally increases slightly on both platforms. On Intel i9, the pruned model latency remains close to the baseline, typically around 0.71 ms, although certain configurations (such as FD003) show higher runtime due to sparsity overhead and implementation effects. On the AWS ARM platform, pruning increases latency more noticeably, ranging from 2.00 to 2.42 ms. This behavior occurs because the unstructured sparsity masks introduce additional memory and indexing overhead that general-purpose CPU kernels cannot fully exploit.

DQ results in the largest increase in latency among the evaluated compression techniques. On the Intel i9 processor, the dynamically quantized model requires approximately 1.30 ms per inference, nearly doubling the baseline runtime. The effect is even stronger on the AWS ARM platform, where latency increases to approximately 2.86–3.05 ms across the datasets. This behavior arises from the additional quantization and dequantization operations performed during runtime, which introduce computational overhead in CPU-based execution environments.

Taken together, Figure 2 and Figure 3 highlight the trade-off between accuracy, model compression, and runtime efficiency. While pruning can provide modest improvements in predictive accuracy with only limited runtime impact on powerful CPUs, dynamic quantization significantly reduces model size but does not necessarily accelerate inference on general-purpose CPUs. These results emphasize that the effectiveness of compression techniques for edge deployment depends not only on the model architecture but also on the characteristics of the target hardware platform.

4.4. Edge Deployment and Compression Suitability Across Degradation Complexity

The results summarized in Table 5 provide a dataset-level perspective on the suitability of different model architectures and compression strategies across varying degrees of degradation in the C-MAPSS benchmark. The four datasets represent progressively more challenging scenarios, ranging from a single operating condition with one fault mode (FD001) to multiple operating conditions and fault types (FD004). These differences influence both the models’ predictive performance and the effectiveness of compression techniques.

Overall, the results highlight that the effectiveness of compression strategies depends strongly on the dataset’s degradation complexity and the model’s architectural characteristics. Pruning can provide modest improvements in prediction accuracy across several scenarios, but it does not consistently improve inference speed due to the overhead associated with unstructured sparsity on general-purpose CPUs. DQ, on the other hand, primarily reduces model size while preserving prediction accuracy, but it introduces additional runtime overhead in CPU-based inference. These observations reinforce the importance of evaluating compression techniques under realistic hardware constraints when designing RUL prediction models for edge deployment.

For simpler scenarios such as FD001, pruning applied to CNN–GRU can provide measurable accuracy improvements while maintaining acceptable runtime performance, whereas the CNN–TCN baseline remains the most latency-efficient option for strict real-time constraints. For datasets with multiple operating conditions, such as FD002, pruning yields smaller benefits, and the baseline CNN–TCN architecture offers a more stable accuracy–latency trade-off. In more complex degradation settings involving multiple fault modes (FD003), pruning demonstrates the most consistent accuracy improvements, suggesting that sparsification can help remove redundant representations in these richer temporal patterns. For the most challenging scenario (FD004), both architectures perform similarly, and moderate pruning provides only limited gains, making the CNN–TCN baseline a practical choice when predictable runtime behavior is required. These observations highlight that compression strategies for edge prognostics should be selected with consideration of both model architecture and degradation complexity.

5. Conclusions

This study presented a deployment-oriented evaluation of compact neural networks for RUL prediction using the NASA C-MAPSS turbofan benchmark. Two lightweight hybrid architectures, CNN–GRU and CNN–TCN, were designed with approximately 28k–32k parameters to serve as realistic RUL predictors for CPU-based edge deployment scenarios. A systematic experimental analysis was conducted across the four C-MAPSS subsets (FD001–FD004), which represent increasing complexity of degradation in terms of operating conditions and fault modes. The evaluation included both baseline models and post-training compression strategies, specifically global unstructured magnitude pruning and dynamic INT8 quantization. To further assess real deployment behavior, inference latency was measured on two CPU platforms: a high-performance Intel x86 workstation and a resource-constrained ARM-based environment.

The results reveal several important insights. First, the CNN–GRU architecture generally achieved superior predictive accuracy across most datasets, whereas the CNN–TCN architecture consistently demonstrated lower inference latency due to its convolution-only temporal modeling and greater computational parallelism. Second, unstructured magnitude pruning yielded moderate improvements in RMSE across several datasets, suggesting that sparsification can serve as a form of regularization for compact models. However, pruning did not consistently reduce inference latency or serialized model size, largely due to the overhead of pruning masks and the lack of sparse-kernel acceleration on conventional CPUs. Third, dynamic quantization substantially reduced the CNN–GRU model size (by approximately 55%) while maintaining nearly identical predictive accuracy. Nevertheless, the quantized models exhibited higher inference latency on both evaluated CPU platforms due to the additional runtime quantization and dequantization operations.

These findings highlight a critical practical implication for edge prognostics: reductions in model size do not necessarily translate into improved runtime performance on general-purpose CPUs. For the compact RUL prediction models studied here, dynamic quantization primarily provides memory-efficiency benefits, whereas unstructured pruning should be interpreted mainly as an accuracy-oriented regularization mechanism rather than a reliable deployment-acceleration technique. At the same time, the observed increase in latency under dynamic quantization should not be interpreted as a universal limitation of low-precision inference. In practice, this overhead may be mitigated through static quantization or quantization-aware training, which can reduce runtime conversion costs by quantizing a larger portion of the computation graph ahead of inference. Likewise, deployment on hardware with stronger native INT8 support may allow quantized models to achieve both memory savings and actual speedup. Therefore, the effectiveness of compression strongly depends on the interaction among the model architecture, quantization method, software backend, and target hardware platform. Future work should investigate hardware-aware compression strategies, including static quantization, structured pruning, sparse-aware execution, and accelerator-targeted deployment, to better align compact PHM models with real-time edge constraints.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/machines14040375/s1.

Author Contributions

Conceptualization, A.N.B.; Methodology, A.N.B.; Software, A.N.B.; Validation, V.H. and M.O.; Formal analysis, V.H. and M.O.; Data curation, M.O.; Writing—original draft, A.N.B. and M.O.; Writing—review and editing, V.H. and M.O.; Supervision, V.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ferreira, C.; Goncalves, G. Remaining Useful Life prediction and challenges: A literature review on the use of Machine Learning Methods. J. Manuf. Syst. 2022, 63, 550–562. [Google Scholar] [CrossRef]
Wang, Y.; Wu, M.; Li, X.; Xie, L.; Chen, Z. A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends. Mech. Syst. Signal Process. 2025, 229, 112449. [Google Scholar] [CrossRef]
Bitam, T.; Yahiaoui, A.; Boubiche, D.E.; Martinez-Pelaez, R.; Toral-Cruz, H. Artificial Intelligence of Things for Next-Generation Predictive Maintenance. Sensors 2025, 25, 7636. [Google Scholar] [CrossRef]
Baptista, M.L.; Mishra, M.; Henriques, E.; Prendinger, H. Using Explainable Artificial Intelligence to Interpret Remaining Useful Life Estimation with Gated Recurrent Unit. Proc. Annu. Conf. PHM Soc. 2024, 16, 1. [Google Scholar] [CrossRef]
Shang, X.; Li, J.; Lou, T.; Wang, Z.; Pang, X.; Zhang, Z. Adaptive Remaining Useful Life Estimation of Rolling Bearings Using an Incremental Unscented Kalman Filter with Nonlinear Degradation Tracking. Machines 2025, 13, 1058. [Google Scholar] [CrossRef]
Ren, L.; Liu, Y.; Wang, X.; Lu, J.; Jamal Deen, M. Cloud–Edge-Based Lightweight Temporal Convolutional Networks for Remaining Useful Life Prediction in IIoT. IEEE Internet Things J. 2021, 8, 12578–12587. [Google Scholar] [CrossRef]
Hsu, H.Y.; Srivastava, G.; Wu, H.T.; Chen, M.Y. Edge Intelligence: Remaining useful life prediction based on state assessment using edge computing on deep learning. Comput. Commun. 2020, 160, 91–100. [Google Scholar] [CrossRef]
Bakhtiyari, A.N.; Wang, Z.; Wang, L.; Zheng, H. A review on applications of artificial intelligence in modeling and optimization of laser beam machining. Opt. Laser Technol. 2021, 135, 106721. [Google Scholar] [CrossRef]
Bakhtiyari, A.N.; Wu, Y.; Qi, D.; Zheng, H. Modeling temporal and spatial evolutions of laser-induced plasma characteristics by using machine learning algorithms. Opik 2023, 272, 170297. [Google Scholar] [CrossRef]
Ngo, D.; Park, H.C.; Kang, B. Edge Intelligence: A Review of Deep Neural Network Deployment at the Edge. Electronics 2025, 14, 2495. [Google Scholar] [CrossRef]
Iftikhar, M.; Shoaib, M.; Altaf, A.; Iqbal, F.; Villar, S.G.; Lopez, L.A.D.; Ashraf, I. A deep learning approach to optimize remaining useful life prediction for Li-ion batteries. Sci. Rep. 2024, 14, 25838. [Google Scholar] [CrossRef]
Tan, W.M.; Teo, T.H. Remaining Useful Life Prediction Using Temporal Convolution with Attention. AI 2021, 2, 48–70. [Google Scholar] [CrossRef]
Ellefsen, A.L.; Bjørlykhaug, E.; Æsøy, V.; Ushakov, S.; Zhang, H. Remaining Useful Life Prediction for Turbofan Engine Degradation Using Deep Learning. Reliab. Eng. Syst. Saf. 2019, 183, 103–115. [Google Scholar] [CrossRef]
Qi, J.; Karimi, H.R.; Uhlmann, Y.; Chen, Z.; Li, W.; Schullerus, G. Uncertainty-aware sensorless anomaly detection using a reliable indicator from position-guided multi-step deep decomposition network: A Survey. Reliab. Eng. Syst. Saf. 2026, 271, 112258. [Google Scholar] [CrossRef]
Mauricio, A.; Qi, J.; Smith, W.; Randall, R.; Gryllias, K. Vibration Based Condition Monitoring of Planetary Gearboxes Operating Under Speed Varying Operating Conditions Based on Cyclo-non-stationary Analysis. In Proceedings of the 10th International Conference on Rotor Dynamics–IFToMM; Springer: Cham, Switzerland, 2018; pp. 265–279. [Google Scholar] [CrossRef]
Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage propagation modeling for aircraft engine run-to-failure simulation. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–9. [Google Scholar] [CrossRef]
Shang, J.; Xu, D.; Qiu, H.; Gao, L.; Jinag, C.; Yi, P. A novel data augmentation framework for remaining useful life estimation with a dense convolutional regression network. J. Manuf. Syst. 2024, 74, 30–40. [Google Scholar] [CrossRef]
Li, Y.; Zhao, R.; Wang, J.; Chen, X. A Deep-Learning Method for Remaining Useful Life Prediction Based on Multisensor Data. Sensors 2025, 25, 497. [Google Scholar] [CrossRef]
Laredo, D.; Chen, Z.; Schutze, O.; Sun, J.Q. A neural network-evolutionary computational framework for remaining useful life estimation of mechanical systems. Neural Netw. 2019, 116, 178–187. [Google Scholar] [CrossRef]
Deng, S.; Zhou, J. Prediction of Remaining Useful Life of Aero-Engines Based on Deep Learning Models. J. Intell. Manuf. Syst. 2024, 5, 1–14. [Google Scholar] [CrossRef]
Adducul, C.J.C.; Macasaet, J.R.I.; Tiglao, N.M.C.; Sun, J.Q. Edge-based Battery Remaining Useful Life Estimation Using Deep Learning. In Proceedings of the 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), Istanbul, Turkiye, 25–27 July 2023; p. 10215733. [Google Scholar] [CrossRef]
Qi, Y.; Wang, J.; Lu, H.; Chu, H.; Chen, D.; Jin, J. Remaining useful life prediction for lithium-ion battery based on hybrid machine learning with Wiener process. J. Energy Storage 2025, 136, 118415. [Google Scholar] [CrossRef]
Chen, L.; Sun, J.; Zhang, K. Research on the Remaining Useful Life Prediction of Aero-Engines Based on Data-Driven Methods. Aerospace 2025, 12, 998. [Google Scholar] [CrossRef]
Crespí-Castañer, L.; Font-Rosselló, J.; Bär, M.; Morán, A.; Frasser, C.F.; Canals, V.; Roca, M.; Rosselló, J.L. Pruning dense neural networks for efficient edge deployment in internet of things applications. Appl. Soft Comput. 2026, 186, 114175. [Google Scholar] [CrossRef]
Zhu, X.; Li, L.; Wang, G.; Shi, N.; Li, Y.; Yang, X. Remaining Useful Life Prediction of Electric Drive Bearings in New Energy Vehicles: Based on Degradation Assessment and Spatiotemporal Feature Fusion. Machines 2025, 13, 914. [Google Scholar] [CrossRef]
Wu, F.; Wu, Q.; Tan, Y.; Xu, X. Remaining Useful Life Prediction Based on Deep Learning: A Survey. Sensors 2024, 24, 3454. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Ristovski, K.; Farhat, A.K.; Gupta, C. Long Short-Term Memory Network for Remaining Useful Life estimation. In Proceedings of the IEEE International Conference on Prognostics and Health Management (ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95. [Google Scholar] [CrossRef]
Zhao, C.; Huang, X.; Li, Y.; Iqbal, M.Y. A Double-Channel Hybrid Deep Neural Network Based on CNN and BiLSTM for Remaining Useful Life Prediction. Sensors 2020, 20, 7109. [Google Scholar] [CrossRef]

Figure 1. Dual-panel architecture of the CNN–GRU and CNN–TCN models with compression overlays. Both models share a lightweight 1D CNN feature extractor.

Figure 2. RMSE comparison across compression methods.

Figure 3. Inference latency comparison across compression methods.

Table 1. Baseline performance of CNN–GRU and CNN–TCN across FD001–FD004.

Dataset	Model	Intel i9 (x86) RMSE	AWS ARM RMSE	Intel i9 (x86) MAE	AWS ARM MAE	Intel i9 (x86) Latency (ms)	AWS ARM Latency (ms)
FD001	CNN–GRU	18.2089	17.9224	13.5509	13.7004	0.6832 ± 0.0015	1.8500 ± 0.0111
FD001	CNN–TCN	25.8189	23.664	19.8687	18.2159	0.6107 ± 0.0023	1.7536 ± 0.0056
FD002	CNN–GRU	29.3623	30.1973	20.628	21.1033	0.6740 ± 0.0016	1.8312 ± 0.0220
FD002	CNN–TCN	30.9515	31.0999	20.7091	20.9767	0.5935 ± 0.0019	1.7754 ± 0.0140
FD003	CNN–GRU	16.1901	17.0541	11.8834	12.7604	0.6867 ± 0.0018	1.8640 ± 0.0275
FD003	CNN–TCN	23.2321	23.3896	17.332	17.7252	0.6024 ± 0.0013	1.8113 ± 0.0062
FD004	CNN–GRU	31.3279	31.7335	23.5949	23.5803	0.6897 ± 0.0023	1.8677 ± 0.0090
FD004	CNN–TCN	31.6151	31.6296	23.724	22.9898	0.6020 ± 0.0015	1.8880 ± 0.0180

Table 2. Unstructured pruning summary for CNN–GRU (best setting per dataset on each CPU platform).

Dataset	Platform	Baseline RMSE/MAE	Best Sparsity	Best Pruned RMSE/MAE	ΔRMSE (abs)	Latency (Median, ms): Baseline → Pruned
FD001	Intel i9 (x86)	18.21/13.55	0.9	16.70 ± 0.18/12.34 ± 0.14	−1.51	0.6959 ± 0.0083 → 0.7154 ± 0.0077
FD001	AWS ARM	17.92/13.70	0.9	15.83 ± 0.19/12.02 ± 0.13	−2.09	1.8753 ± 0.0220 → 2.4249 ± 0.0449
FD002	Intel i9 (x86)	29.36/20.63	0.3	29.13 ± 0.12/20.437 ± 0.10	−0.22	0.7113 ± 0.0098 → 0.7106 ± 0.0031
FD002	AWS ARM	30.20/21.10	0.3	28.61 ± 0.28/20.19 ± 0.19	−1.59	1.8931 ± 0.0244 → 2.0550 ± 0.0394
FD003	Intel i9 (x86)	16.19/11.88	0.3	16.51 ± 0.21/11.95 ± 0.13	0.32	1.8150 ± 0.0088 → 1.8628 ± 0.0062
FD003	AWS ARM	17.05/12.76	0.9	15.46 ± 0.19/10.89 ± 0.11	−1.60	1.8867 ± 0.0167 → 2.0043 ± 0.0117
FD004	Intel i9 (x86)	31.33/23.59	0.7	30.26 ± 0.42/21.63 ± 0.31	−1.06	0.6807 ± 0.0023 → 0.7085 ± 0.0016
FD004	AWS ARM	31.73/23.58	0.7	30.51 ± 0.38/21.46 ± 0.29	−1.21	1.8751 ± 0.0087 → 2.0335 ± 0.0216

The best pruned setting is selected by the lowest test RMSE among {0.0, 0.3, 0.5, 0.7, 0.9} after 5-epoch fine-tuning.

Table 3. Unstructured pruning summary for CNN–TCN (best setting per dataset on each CPU platform).

Dataset	Platform	Baseline RMSE/MAE	Best Sparsity	Best Pruned RMSE/MAE	ΔRMSE (abs)	Latency (Median, ms): Baseline → Pruned
FD001	Intel i9 (x86)	25.8189/19.8687	0.7	21.85 ± 0.92/16.66 ± 0.59	−3.97	0.6065 ± 0.0107 → 0.6788 ± 0.0120
FD001	AWS ARM	23.6640/18.2159	0.7	22.02 ± 0.41/17.03 ± 0.33	−1.64	1.8624 ± 0.0125 → 2.3053 ± 0.0186
FD002	Intel i9 (x86)	30.9515/20.7091	0.7	28.84 ± 0.38/19.77 ± 0.19	−2.11	0.5943 ± 0.0029 → 1.8472 ± 0.0086
FD002	AWS ARM	31.0999/20.9767	0.7	28.76± 0.53/19.19 ± 0.41	−2.34	1.9213 ± 0.0086 → 2.2412 ± 0.0175
FD003	Intel i9 (x86)	23.2321/17.3320	0.7	20.26 ± 0.39/15.36 ± 0.29	−2.97	1.6387 ± 0.0163 → 0.6734 ± 0.0016
FD003	AWS ARM	23.3896/17.7252	0.7	18.51 ± 0.33/14.11 ± 0.27	−4.88	1.8900 ± 0.0183 → 2.2711 ± 0.0296
FD004	Intel i9 (x86)	31.6151/23.7240	0.5	31.09 ± 0.55/22.89 ± 0.45	−0.52	0.5992 ± 0.0055 → 0.6709 ± 0.0032
FD004	AWS ARM	31.6296/22.9898	0.7	30.65 ± 39/21.36 ± 0.28	−0.98	1.8961 ± 0.0122 → 2.3168 ± 0.0142

The best pruned setting is selected by the lowest test RMSE among {0.0, 0.3, 0.5, 0.7, 0.9} after 5-epoch fine-tuning.

Table 4. Baseline FP32 vs. dynamic post-training quantization (INT8) results for CNN–GRU and CNN–TCN on NASA C-MAPSS datasets (FD001–FD004).

Dataset	Model	Platform	Variant	RMSE	MAE	Latency (ms) Mean ± Std	Model Size (MB)
FD001	CNN–GRU	Intel i9	FP32	18.2089	13.5509	0.696 ± 0.011	0.113
		Intel i9	DQ	18.2918	13.6228	1.299 ± 0.008	0.0506
		AWS ARM	FP32	17.9224	13.7004	1.872 ± 0.006	0.113
		AWS ARM	DQ	17.7591	13.5557	3.029 ± 0.051	0.0506
	CNN–TCN	Intel i9	FP32	25.8189	19.8687	0.609 ± 0.009	0.133
		Intel i9	DQ	25.9588	19.9701	0.720 ± 0.013	0.129
		AWS ARM	FP32	23.664	18.2159	1.933 ± 0.011	0.133
		AWS ARM	DQ	23.4814	18.0743	2.250 ± 0.036	0.129
FD002	CNN–GRU	Intel i9	FP32	29.3623	20.628	0.692 ± 0.010	0.113
		Intel i9	DQ	29.3411	20.6608	1.323 ± 0.022	0.0506
		AWS ARM	FP32	30.1973	21.1033	1.949 ± 0.062	0.113
		AWS ARM	DQ	30.3749	21.1801	2.940 ± 0.040	0.0506
	CNN–TCN	Intel i9	FP32	30.9515	20.7091	0.619 ± 0.015	0.133
		Intel i9	DQ	31.0844	20.7995	0.734 ± 0.008	0.129
		AWS ARM	FP32	31.0999	20.9767	1.919 ± 0.004	0.133
		AWS ARM	DQ	31.2126	21.0167	2.196 ± 0.018	0.129
FD003	CNN–GRU	Intel i9	FP32	16.1901	11.8834	0.703 ± 0.011	0.113
		Intel i9	DQ	16.2135	12.0223	1.303 ± 0.013	0.0506
		AWS ARM	FP32	17.0541	12.7604	1.856 ± 0.012	0.113
		AWS ARM	DQ	17.1435	12.9021	2.866 ± 0.028	0.0506
	CNN–TCN	Intel i9	FP32	23.2321	17.3320	0.624 ± 0.014	0.133
		Intel i9	DQ	23.0435	17.1187	0.708 ± 0.004	0.129
		AWS ARM	FP32	23.3896	17.7252	1.899 ± 0.022	0.133
		AWS ARM	DQ	23.3319	17.6808	2.225 ± 0.037	0.129
FD004	CNN–GRU	Intel i9	FP32	31.3279	23.5949	0.698 ± 0.013	0.113
		Intel i9	DQ	31.2726	23.6367	1.301 ± 0.004	0.0506
		AWS ARM	FP32	31.7335	23.5803	1.893 ± 0.020	0.113
		AWS ARM	DQ	31.5361	23.5741	3.051 ± 0.075	0.0506
	CNN–TCN	Intel i9	FP32	31.6151	23.724	0.619 ± 0.008	0.133
		Intel i9	DQ	31.3849	23.585	0.711 ± 0.007	0.129
		AWS ARM	FP32	31.6296	22.9898	1.970 ± 0.022	0.133
		AWS ARM	DQ	31.522	22.9775	2.261 ± 0.027	0.129

Table 5. Edge deployment suitability across degradation complexity levels based on baseline, pruning, and quantization results.

Dataset	Operating Conditions/Fault Modes	Best Baseline Accuracy	Latency-Efficient Architecture	Most Effective Compression Strategy	Accuracy Impact	Latency Impact	Edge Deployment Recommendation
FD001	One condition, one fault	CNN–GRU (lowest RMSE)	CNN–TCN	Pruning (0.7–0.9 sparsity)	Moderate RMSE improvement	Slight latency increase	CNN–TCN baseline for real-time edge deployment; CNN–GRU + pruning when accuracy is prioritized
FD002	Multiple conditions, one fault	CNN–GRU	CNN–TCN	Pruning (~0.3–0.7 sparsity)	Small accuracy improvement	Minimal latency change	CNN–TCN baseline preferred for stable runtime performance
FD003	One condition, two faults	CNN–GRU	CNN–TCN	Pruning (~0.7 sparsity)	Consistent RMSE reduction	Slight latency increase	CNN–GRU + pruning offers the best accuracy–size trade-off
FD004	Multiple conditions, two faults	Comparable (CNN–GRU ≈ CNN–TCN)	CNN–TCN	Moderate pruning (0.5–0.7)	Small RMSE reduction	Minor latency increase	CNN–TCN baseline recommended for predictable runtime

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Naderi Bakhtiyari, A.; Hassani, V.; Omidi, M. CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction. Machines 2026, 14, 375. https://doi.org/10.3390/machines14040375

AMA Style

Naderi Bakhtiyari A, Hassani V, Omidi M. CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction. Machines. 2026; 14(4):375. https://doi.org/10.3390/machines14040375

Chicago/Turabian Style

Naderi Bakhtiyari, Ali, Vahid Hassani, and Mohammad Omidi. 2026. "CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction" Machines 14, no. 4: 375. https://doi.org/10.3390/machines14040375

APA Style

Naderi Bakhtiyari, A., Hassani, V., & Omidi, M. (2026). CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction. Machines, 14(4), 375. https://doi.org/10.3390/machines14040375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CPU Deployment-Oriented Evaluation of Compact Neural Networks for Remaining Useful Life Prediction

Abstract

1. Introduction

2. Methodology

2.1. Problem Formulation

2.2. Baseline Model Architecture

2.3. Compression Methods

2.4. Evaluation Metrics

3. Dataset and Experimental Setup

3.1. Dataset Description: NASA C-MAPSS

3.2. Preprocessing and Data Preparation

3.3. Experimental Setup

4. Results and Discussion

4.1. Unstructured Magnitude Pruning

4.2. Dynamic Quantization

4.3. RMSE and Latency Comparison Across Compression Methods

4.4. Edge Deployment and Compression Suitability Across Degradation Complexity

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI